Rui like Cloud Tech | Talking Alarm Management Capability Maturity Model

     With the cloud of the container application operating environment, system architecture of the IT infrastructure of micro-services, more and more companies have introduced more tools, more and more complex processes of operation and maintenance personnel to enhance the fineness of IT systems management, but a new problem has cropped up.

Like the butterfly effect, in such a complex environment, between data are closely linked, a change in an indicator, an alarm may trigger a series of chain reactions. Red identify the different monitoring platform, the influx of e-mail and SMS alerts, holding tight nerve operation and maintenance personnel, warning of meticulous management is imperative.

 

Challenging operation and maintenance alarm management

     How to suppress alarm storms? How to protect important warning does not leak lose? How fast screening root cause alarm? How to dispose of precipitation alarms experience? How to quickly restore business operations? These are the most difficult issues faced by each operation and maintenance team at work. In the end is what causes so frequent alarm storms, alarm management to bring such a high degree of complexity of it?

between application systems closer together

     Complete a business often needs across multiple applications, each application calls the link problem in the IT unit, it is likely to lead to business failure. Any alarm system, a monitoring object can trigger alarms a number of other related policies, warning of massive correlation as high as 90%, meaning that 90 percent of all alarms can be attributed to a source of alarm.

alarm policy settings difficult to find a balance

     High alarm threshold, the system is easy to miss operational failure; and low alarm thresholds, will bring a large number of invalid alarms, affect the efficiency of operation and maintenance team. Also, set the alarm to check the length of the cycle, there are similar problems. Operation and maintenance team often in order not to fall off the alarm, had to raise the alarm sensitivity, and this alarm repetition rate may be as high as 60%.

timely response is not high warning

     Handle multiple individuals involved in the same type of alarm is currently working mode most of the operation and maintenance team, ranging from 2-3, 9-10 and more to the people, the same warning will be pushed into the hands of multiple operation and maintenance personnel. However, in some special periods usually only one attendant is responsible for handling alarms, which brought great disturbance to other team members live. Because of the lack of efficient dispatch and scheduling management system, coupled with a large number of repeat invalid information, which will cause delays and missed alarm handling to some extent, causing alarm storms.

 

"Alarm Management Capability Maturity Model" almost certain

     In order to enhance the efficiency of IT operation and maintenance management system, minimizing operation and maintenance management difficulties, AIOps technological development has become an inevitable choice. The alarm management as an important part of AIOps, and on the next monitoring tool, then under the ITIL processes and automation platform, the entire operation and maintenance monitoring system in the central nexus. High and low alarm management capacity constraints become the IT operation and maintenance of SLA ( Service-Level Agreement , Service Level Agreement critical) of.

     To help companies more quantitative assessment of the current alarm management, alarm management platform clear objectives and evolutionary path, we will alert management capability is divided into five levels, the integration of the "Alarm Management Capability Maturity Model", in accordance with each level management varying degrees of ability, showing a progressive manner, the high-level content includes low-level content.

Table: Alarm Management Capability Maturity Model Classification

 

Level 1, the alarm decentralized management

     Our operation and maintenance team for comprehensive coverage as possible in all aspects of IT systems had to introduce multiple monitoring tools, different monitoring tools will produce tens of thousands of alarms, the alarms need to analyze, priority screening, and execution plan operations. Over time, perhaps hundreds of thousands, millions of alarm events need to be concerned about.

     Because of the lack of the centralized management and assignment of the alarm, the alarm information transfer between different objects in the operation and maintenance personnel disorder, resulting in inefficient processing and alarm response. Strictly speaking, this level is still far from maturity management.

Level 2, unified alarm management

     More and more operations teams have been aware of the disorder caused by high administrative costs and low efficiency of troubleshooting. According to statistics, more than 20% of the company itself or to the unified management of alarms through the use of third-party platform operation and maintenance team.

     The alarm system generates different monitoring tools or access into a unified management platform, and alarms can be de-duplication based on certain rules, filters and compression. This level of management capability maturity to break the boundaries of monitoring tools to service or scenes perspective of the division of labor based on functional operation and maintenance team, as in accordance with the business or IT infrastructure division of labor, the alarm categories, combined with more efficient collaboration tools, such as nail nails, micro-channel business, Slack, etc., to a certain extent, enhance the efficiency of troubleshooting.

Level 3, intelligent alarm management

     Business is changing, monitoring requirements are changing, because the issue warning to heavy rigid rules brought self-evident. By statistical analysis of large amounts of data, fewer than 40% of the alarm can be compressed by the rules.

     With the continuous development of artificial intelligence, especially NLP (Natural Language Processing, Natural Language Processing) technology matures, warning against this type of text data classification, clustering, pattern discovery algorithms become effective inhibition of alarm storms, improve warning the main means of effectiveness. By temporal correlation, text similarity, abduction FIG failure, CMDB (Configuration Management Database, the configuration management database) and other means, similar mass data, alarms associated polymerization. For important information warning of abnormal, novelty and so on, by the time entropy and entropy content identification, the more infrequent, irregular, severe high alert needs to be more attention, the more important the larger the entropy value information. Intelligent alarm management will greatly reduce the amount of alarm handling, enhance the efficiency of the alarm failure analysis.

Level 4, root alarm Location

     根因定位一直是告警管理皇冠上的那颗明珠。由于告警的传递性和多面性,要在众多错综复杂的信息中迅速定位根因对所有运维团队来说都是巨大的挑战。

     关于根因定位的探索大致可以分为以下三个方向,一是基于动态获取的系统调用链路和承载关系,并结合时间相关性开展根因分析;二是基于CMDB构建一个实时反映系统环境的配置项和关系二元组群,通过告警在其中的投射关系进行根因定位;三是建立全面覆盖IT运维管理全域的实体、属性、关系三要素库,再运用知识图谱算法获得根因告警。当然不论是哪一种方案,都需要建立在对IT系统架构的深度学习和理解基础之上,才能真正做到明辨真伪,洞悉根因。

Level 5,告警自愈

     告警自愈是一套完备的故障自动化处理流程,通过打通监控工具、告警平台、任务调度平台、CMDB、ITIL等相关系统,实现从告警接收,根因定位,规则匹配,脚本执行,故障恢复,人工确认,最后到告警恢复,真正实现告警的全生命周期管理。

     除了Level 4中根因告警定位这个技术难点外,整个告警自愈过程还有另一个关键点,就是告警故障知识库的建立,这是日常运维工作经验的积累和沉淀,也是故障恢复方案的基础。但这也恰恰是我们很多企业的软肋,大量的故障处理经验都存在于运维人员各自的大脑中,日常中更多的依靠个人能力去排查和恢复故障。随着运维人员的流动,这些最为宝贵的资产也随之流失,这使得一个重复故障的处理也需要进行重新分析,不必要的拉长了故障恢复时间。

     告警自愈能帮助运维团队第一时间查明问题原因,实现故障的快速修复。同时还能帮助运维团队沉淀问题处置经验,防范潜在风险,最终形成系统运维的闭环管理。

     目前,越来越多的企业在告警管理领域展开探索,并且在告警风暴抑制上取得了一定的成效。睿象云的智能告警平台也在帮助不同行业的运维团队解决告警集中和智能管理的问题。运维之路,艰苦漫长,告警的持续改进也不能一蹴而就,相信随着技术的发展和经验的积累,告警管理必将迎来跨越式发展的盛夏。我们也希望通过大家对告警管理能力成熟度模型的探讨和实践,引领我们共同步入无人值守这个运维终极目标。

Guess you like

Origin www.cnblogs.com/ruixiangyun/p/11344227.html