Application of intelligent operation and maintenance root cause analysis

With the growing popularity of big data cloud computing technology, the rapid development of distributed technology, micro-popular service, calls between the level of business systems more and more, call the increasingly complex relationship. When the business system failure, how to quickly analyze complex relationships from calling in question accurately locate the root cause of failure because of the point, and by automatic operation and maintenance system to achieve self-healing, it is the issue of intelligent operation and maintenance focus. Root cause analysis as an important and difficult operation and maintenance of intelligent, academia and industry has been exploring the issue of operation and maintenance, except for a few cases been reported in a particular scene, few substantive breakthrough, really want to overcome this difficulty, there's still a long way to go.

Special call New Energy Co., Ltd. (hereinafter referred to as "special call") As the charging ecological industry leader, has accumulated a massive monitoring data in the business process of rapid development, operation and maintenance for the fault line, also face the root cause analysis of the problem of pain points . Special calls cloud platform SRE staff closely follow the development trend AIOps, machine learning combined with the operation and maintenance practices, through continuous exploration, because analysis has achieved some success in the root.

This article from the monitoring system, root cause analysis system, cause analysis, root cause analyzes and other abnormal summarize related work, willing to discuss with my colleagues in operation and maintenance circles, as a root cause analysis of development contribute to a timeout root.

First, the monitoring system

Special calls cloud platform monitoring system after years of iterative development, has been completed based on real-time streaming of high-performance computing, high concurrency, high availability of full three-dimensional monitoring system, as a business systems and stable operation of the eyes, always fixed on system status the following is a monitoring system logic architecture:

 

 Monitoring data mainly from three aspects:

Indicators (Metrics): a polymerizable recording data.

Link (Trace): request information for recording information and a correlation between the range.

Log (Log): used to record discrete events, such as running log, debug log, behavior logs, exception log and so on.

Essentially looking for the root cause correlation, causation from massive monitoring data analysis, and therefore build a comprehensive full-Link monitoring system is the basis for root cause analysis.

Gold surveillance system has four main indicators:

Traffic: business call volume per unit of time, such as: services QPS, and other items per order.

Time: when a specific processing business long, time-consuming need to distinguish between success and failure time consuming.

Error: wrong number calls, the success rate, failure rate.

Saturation: Application has been accounted for using the resource.

Root cause analysis aims to analyze existing system problems, pay more attention to time-consuming and error indicators, and therefore time-out wrong type of link is a full root cause analysis of major concern.

We will be full link is further subdivided into full link technology (TTrace), a full service link (BTrace), are used to track the flow of technology and service node transfer request, and through efficient reduction technology, the intricate inter-system calls relations Tiaofenlvxi unfolded.

Technical full link (TTrace) began in earnest in Google's Dapper paper, Open Source Software Foundation CNCF of OpenTracing agreement provides a unified concept and data standards for the Trace. TTrace can track a request through clusters, machines, processes (distributed service framework HSF, service gateway SG, Messaging Application Center MAC), middleware (RabbitMQ, Kafka, relational databases, Redis), is the investigation and cross-application, cross-node, distributed cross-process call requests performance issues edged weapon.

 

Business-Link (BTrace) is from the perspective of the business, monitor the status and health of the entire flow of business processes, without having to switch multiple business systems, and visualize the global downstream business conditions for business systems quickly identify problems, locate problem.

 Full link technology and business links precipitated the whole relationship between the monitored object, to provide a strong root cause analysis support the analysis.

Second, root cause analysis system

For four gold index detection, root cause analysis focused on two scenarios: There is a system timeout and unusual problems, the corresponding timeout root cause analysis, anomaly root cause analysis.

The following is a root cause analysis of functional architecture:

 

 Root cause analysis engine based on Kafka, Flink and other high-performance monitoring data processing middleware, combined with HanLP such as natural language processing and efficient framework for overtime and unusual problem by introducing decision tree algorithm, clustering, classification algorithms, machine learning methods root cause analysis, and root cause analysis for real-time statistical results, and send to the corresponding person in charge in the manner of early warning messages.

Third, the time-out root cause analysis

Link Timeout for root cause analysis, decision tree algorithm is more efficient machine learning algorithm, to collect link timeout, the management node, node expansion, cutting node, merging redundant rules, a plurality of processing steps results. By starting from the root node of the link, to be tested corresponding characteristic properties classifiers and outputs the selected value according to the branch, until it reaches the leaf node, the leaf node is stored by the root as a final decision result categories.

由于偶发超时的情况比较多,并不是每一个超时都是故障,我们的做法是在触发超时预警后进行根因分析,并且保证根因分析引擎在预警发生后1分钟内给出超时根因分析结果,将根因分析结果跟在上一条预警信息后面,同时提供层层穿透联查功能,可以查看根因分析过程,以及对应的超时全链路信息。

 

超时的根因情况很多,比较常见的故障是某个节点发生了阻塞,这种情况是比较严重的故障,需要重点对待。但是针对一次超时链路的根因分析只能反映一条链路存在的问题,当故障发生时,往往会产生大量的超时,这种单链路超时根因分析,并不能掌握所有链路的超时情况,因此需要将超时根因分析的结果进行二次聚类,以便从全局层面找到阻塞点。

四、异常根因分析

异常根因分析之所以很难有成效,一个重要的原因是开发人员在处理异常时,对异常进行了层层封装,导致异常被不断向外抛出时,很多有价值的信息被隐藏在了最底层,因此进行根因分析时需要采用抽丝剥茧的方式,从最外层异常一直找到最内层异常,(开发人员)系铃容易(根因分析)解铃难,根因分析效果很不理想。

解铃还须系铃人,针对异常层层封装难于分析的现状,我们从源头抓起,在各中间件层对捕获到异常进行逐层分析,然后将最内层异常埋点上报监控系统,从而在客户端就对异常进行初步根因分析。

从各客户端上报到服务端的异常信息,虽然是最内层异常,但是文本结构仍然比较杂乱,因此需要采用NLP机器学习技术对其进行自然语言处理:对异常文本进行分词,并将分词后的文本进行特征提取,生成词袋向量模型,然后代入到聚类算法,实现异常文本的聚类,如下所示:

 

经过对最内层的异常信息进行实时聚类后,结合上报的其他异常设置(客户端地址、访问目标地址、源服务ID、目标服务ID等),全局层面的异常根因分析统计便会一览无余的呈现在运维人员面前,如下所示:

 

 异常信息文本聚类只是迈出了异常根因分析的第一步,因为尽管系统有很多异常,但并不一定代表系统有故障,所以需要建立故障特征库,并将异常聚类信息与识别出的故障特征进行匹配,从而实现故障根因分析,如下所示:

 

 五、应用价值

随着业务系统的日益复杂,互联网企业7*24小时的线上运维压力越来越大,依靠传统的运维手段,完全无法应对线上故障的根因分析及故障恢复,经常会陷入手忙脚乱、束手无策的混沌局面,因此具备基于机器学习的智能运维根因分析能力,是每个运维人员的必备技术保障。

我们基于机器学习技术,结合自身运维实践研发的根因分析产品,有效应对了线上海量监控数据的运维压力,将故障定位分析能力从以前的30分钟缩短到1分钟,极大提升了故障定位能力,为故障的快速恢复赢得了宝贵时间。

通过根因分析技术,不仅实现了故障的快速定位,还实现预警消息的收敛发送,保证只把根因预警发送给运维人员,极大提升了运维人员处理问题的效率。

六、发展规划

根因分析虽然能定位问题产生的原因,但有时并不一定是问题产生的源头,所以需要做进一步的根源分析,也即系统变更分析,比如补丁发布、促销活动、混沌工程演练等,正是这些源头上的变更,导致了潜在故障的发生。

要实现真正的智能运维,除了要有强大的根因分析、故障分析能力,还必须有强大的自动运维能力,因此将根因分析、故障分析结果对接自动运维,实现故障的自发现、自分析、自愈,是每个运维人员的终极梦想。

七、总结

健全的海量高并发实时监控体系,完善的全链路监控能力,强大的根因分析能力是互联网企业必备的监控运维必杀技,进一步强化故障分析能力、根源分析能力,并辅以自动运维能力,智能运维甚至无人运维并非遥不可及。

八、特来电云计算与大数据微信公众号

1.微信公众号名称:特来电云计算与大数据

 2.二维码:

 

Guess you like

Origin www.cnblogs.com/liugh/p/12127884.html