DevOps operation and maintenance system: monitoring and management

ITIL 4 monitoring management

With the popularity of DevOps, people are paying more and more attention to automated operation and maintenance. Among them, monitoring pre-warning and monitoring self-healing are becoming more and more popular. In the two books "DevOps Practice Guide" and "System Method for Continuously Delivering Reliable Software" (the textbook level of DevOps), both involve explaining monitoring management and implementation. In fact, monitoring is not a new concept for a long time, and the operation and maintenance community has been constantly exploring whether in theory or tools. Although monitoring management has not been submitted in ITIL V2, it is introduced as an operational activity in "Service Operation" of ITIL V3, and the monitoring-related event management is explained as an independent process. In ITIL 4, monitoring management and event management together form a service management practice (Practice)-"monitoring and event management practice". 

How is monitoring explained in ITIL 4?
Many people are familiar with various monitoring tools, but cannot summarize and explain the activities of monitoring management from a higher process level. In this regard, we can look at the explanation of ITIL 4.

1. Monitoring and event management are inseparable. It should be noted that the "event" here does not mean "fault". The meaning of the event is:

Event: Any state change that is significant to the management of a service or other configuration item (CI).

In ITIL 4, there is a special explanation of "monitoring and event" management practices. The purpose of this practice is to systematically observe services and service components, and record and report status changes identified as events. This practice identifies and prioritizes infrastructure, services, business processes, and information security incidents, and establishes appropriate responses to these incidents, including responses to situations that may lead to potential failures or incidents.

The monitoring part focuses on services and configuration items (CI) to detect potentially important conditions, track and record the status of service programs and CI, and provide this information to relevant personnel. The incident management practice part focuses on the monitoring of state changes defined by the organization as incidents, determining their importance, and identifying and initiating the correct response to them. Information about the incident will also be recorded, stored and provided to relevant personnel. Simply put, monitoring is to produce monitoring data and information, and events are to consume these data and information, and to develop corresponding response plans.

2. The main process of monitoring and event management:

The monitoring and event management practices form three processes:

●Monitoring planning process: The process of adding monitoring items to the monitoring, defining the priority of the monitoring items, selecting the characteristics to be monitored, determining the indicators and thresholds for event classification, and matching the event with the responsible action plan and team.

●Event handling process

●Monitoring and incident management review: This process is a review process that is planned or triggered for post-event analysis, updating of screening and correlation analysis, service "health model", automation and operability monitoring improvement.

See the figure below for specific activities:
DevOps operation and maintenance system: monitoring and management

3. Classification of output information for monitoring:

What we need to pay attention to is that monitoring is necessary for event management, but not all monitoring results will detect events. Thresholds and other criteria determine which state changes will be considered events. Also, it should be noted that not all events have the same importance or require the same response. We need to define classification criteria for the types of events that occur. Typical categories, in order of increasing importance, are informational events, warning events, and abnormal events.

Information: Events that do not require any measures and do not represent abnormal conditions. They are generally used to check the status of equipment or services, or to confirm the completion of activities or tasks. For example: the device is successfully connected to the network, the transaction is successfully completed, etc.

Warning: When a service or equipment is close to a set threshold, an event is intended to notify relevant personnel, processes, or tools to check the situation and take corresponding measures to prevent abnormal situations. For example: the server's memory continues to increase from 65% to 75%, and the server's response time is unacceptably long, which will violate OLA; the conflict rate on the network has increased by 15% in the past hour.

Abnormal: The current operation of the service or equipment is abnormal, which violates OLA or SLA. It should be noted that abnormal conditions do not always appear as malfunctions. For example, if an unauthorized device is found on the network, this is an abnormal situation. According to the fault and change management process, these exceptions can be handled through faults and changes.

We need to match events in a pre-defined sequence to a series of standards and rules, also called business rules, to determine the level and type of business impact. According to business rules, we also need to determine triggers and response measures. Response measures can include recording events, automatic responses, alarms and manual interventions, faults, problems or changes, etc. These response measures also create interfaces with other practices (processes).

4. Interface with other practices:

As shown in Table 2.1, the following activities are closely related to monitoring and event management. Remember that ITIL practices are just a collection of tools used in a value stream environment and should be combined as necessary according to the situation.
DevOps operation and maintenance system: monitoring and management


Implementation of monitoring management

Although ITIL 4 explained the framework of monitoring and management, it did not give any tools and implementation methods that can be implemented. Of course, this is also the style of ITIL as always. The monitoring tools I have come into contact with in my work include Zabbix, Nagios, ELK+Grafana. There are many articles about these tools on the Internet, so I will repeat them here.

Guess you like

Origin blog.51cto.com/yazi0127/2550306