Talking Alarm Management Capability Maturity Model

With the cloud of the container application operating environment, system architecture of the IT infrastructure of micro-services, more and more companies have introduced more tools, more and more complex processes of operation and maintenance personnel to enhance the fineness of IT systems management, but a new problem has cropped up.
Like the butterfly effect, in such a complex environment, between data are closely linked, a change in an indicator, an alarm may trigger a series of chain reactions. Red identify the different monitoring platform, the influx of e-mail and SMS alerts, holding tight nerve operation and maintenance personnel, warning of meticulous management is imperative.

Challenging operation and maintenance alarm management

How to suppress alarm storms? How to protect important warning does not leak lose? How fast screening root cause alarm? How to dispose of precipitation alarms experience? How to quickly restore business operations? These are the most difficult issues faced by each operation and maintenance team at work. In the end is what causes so frequent alarm storms, alarm management to bring such a high degree of complexity of it?

Between application systems closer together
to complete a business often needs across multiple application systems, IT units on each question link applications call, are likely to lead to business failure. Any alarm system, a monitoring object can trigger alarms a number of other related policies, warning of massive correlation as high as 90%, meaning that 90 percent of all alarms can be attributed to a source of alarm.

Alarm policy settings difficult to find a balance point
high alarm threshold, the system is easy to miss operational failure; and low alarm thresholds, will bring a large number of invalid alarms, affect the efficiency of operation and maintenance team. Also, set the alarm to check the length of the cycle, there are similar problems. Operation and maintenance team often in order not to fall off the alarm, had to raise the alarm sensitivity, and this alarm repetition rate may be as high as 60%.

Timeliness is not high alert in response to
multiple people involved in the processing of the same type of alarm is currently working mode most of the operation and maintenance team, ranging from 2-3, 9-10 and more to the people, the same warning will be pushed to multiple transportation in the hands of maintenance personnel. However, in some special periods usually only one attendant is responsible for handling alarms, which brought great disturbance to other team members live. Because of the lack of efficient dispatch and scheduling management system, coupled with a large number of repeat invalid information, which will cause delays and missed alarm handling to some extent, causing alarm storms.

"Alarm Management Capability Maturity Model" almost certain

In order to enhance the efficiency of IT operation and maintenance management system, minimizing operation and maintenance management difficulties, AIOps technological development has become an inevitable choice. The alarm management as an important part of AIOps, on the next monitoring tool, then under the ITIL processes and automation platform, the entire operation and maintenance monitoring system in the central nexus. High and low alarm management capability has become the key constraints IT operation and maintenance SLA (Service-Level Agreement, Service Level Agreement) is.
To help companies more quantitative assessment of the current alarm management, alarm management platform clear objectives and evolutionary path, we will alert management capability is divided into five levels, the integration of the "Alarm Management Capability Maturity Model", in accordance with each level management varying degrees of ability, showing a progressive manner, the high-level content includes low-level content.
Table: Alarm Management Capability Maturity Model Classification

Level 1, the alarm decentralized management

Our operation and maintenance team for comprehensive coverage of all aspects of IT systems as much as possible, have introduced more monitoring tools, different monitoring tools will produce tens of thousands of alarms, the alarms need to analyze, priority screening, and execution plan operations. Over time, perhaps hundreds of thousands, millions of alarm events need to be concerned about.
Because of the lack of the centralized management and assignment of the alarm, the alarm information transfer between different objects in the operation and maintenance personnel disorder, resulting in inefficient processing and alarm response. Strictly speaking, this level is still far from maturity management.
Talking Alarm Management Capability Maturity Model

Level 2, unified alarm management

More and more operations teams have been aware of the disorder caused by high administrative costs and low efficiency of troubleshooting. According to statistics, more than 20% of the company itself or to the unified management of alarms through the use of third-party platform operation and maintenance team.
The alarm system generates different monitoring tools or access into a unified management platform, and alarms can be de-duplication based on certain rules, filters and compression. This level of management capability maturity to break the boundaries of monitoring tools to service or scenes perspective of the division of labor based on functional operation and maintenance team, as in accordance with the business or IT infrastructure division of labor, the alarm categories, combined with more efficient collaboration tools, such as nail nails, micro-channel business, Slack, etc., to a certain extent, enhance the efficiency of troubleshooting.
Talking Alarm Management Capability Maturity Model

Level 3, intelligent alarm management

业务在变,监控需求也在变,因为告警去重规则的死板而带来的问题不言而喻。通过大量的数据统计分析,只有不到40%的告警能够通过规则进行压缩。
随着人工智能技术的不断发展,特别是NLP(Natural Language Processing,自然语言处理)技术的成熟,针对告警这类文本数据的分类、聚类、模式发现算法,成为了有效抑制告警风暴,提升告警有效性的主要手段。可以通过时间相关性、文本相似度、故障溯因图、CMDB(Configuration Management Database,配置管理数据库)等手段,对海量数据中相似、相关的告警进行聚合。针对告警中的异常、新奇等重要信息,通过时间熵和内容熵进行标识,越是不频发、无规律、严重度高的告警越需要被重视,熵值越大信息越重要。告警智能管理将极大减少告警处理量,提升告警故障分析效率。
Talking Alarm Management Capability Maturity Model

Level 4,根因告警定位

根因定位一直是告警管理皇冠上的那颗明珠。由于告警的传递性和多面性,要在众多错综复杂的信息中迅速定位根因对所有运维团队来说都是巨大的挑战。
关于根因定位的探索大致可以分为以下三个方向,一是基于动态获取的系统调用链路和承载关系,并结合时间相关性开展根因分析;二是基于CMDB构建一个实时反映系统环境的配置项和关系二元组群,通过告警在其中的投射关系进行根因定位;三是建立全面覆盖IT运维管理全域的实体、属性、关系三要素库,再运用知识图谱算法获得根因告警。当然不论是哪一种方案,都需要建立在对IT系统架构的深度学习和理解基础之上,才能真正做到明辨真伪,洞悉根因。
Talking Alarm Management Capability Maturity Model

Level 5,告警自愈

Alarm is a complete failure self-healing automation processes, open up the monitoring tools, alarm platform, task scheduling platform, CMDB, ITIL and other related system, received from the alarm, locate the root cause, rule matching, script execution, fault recovery, manual confirmation, and finally to restore the alarm, the real alarm lifecycle management.
In addition to cause alarm locate the technical difficulties, the entire alarm healing process there is another key point is to establish a knowledge base of fault alarm root Level 4, which is the daily operation and maintenance work experience accumulation and precipitation, but also the basis for the recovery program . But this is precisely what many of our corporate weakness, a lot of experience in dealing with faults exist in the operation and maintenance personnel in their brain, the more everyday rely on personal ability to troubleshoot and restore failures. With the flow of operation and maintenance personnel, which also will be the most valuable asset loss, which makes the process a repeated failure also need to be re-analyzed, unnecessarily lengthen the recovery time.
Alarm operation and maintenance team can help heal the first time to identify the cause of the problem, the failure to achieve a quick fix. While helping operations teams sediment disposal experience problems prevent potential risks, and ultimately form a closed loop management system operation and maintenance.
Talking Alarm Management Capability Maturity Model

At present, more and more companies to start exploration in the field of alarm management, and has achieved some success in the storm alarm suppression. Rui like cloud intelligent alarm platform also helping different sectors of operation and maintenance team to solve the problem of centralized alarm management and intelligent. Operation and maintenance of the road, long and hard, continuous improvement alarm can not be done overnight, we believe that with the accumulation of experience and development of technology, alarm management will usher in summer by leaps and bounds. We also hope that through discussion and practice we Maturity Model for alarm management capabilities, leading us into the common unattended operation and maintenance of this ultimate goal.

Guess you like

Origin blog.51cto.com/14429589/2429078