How to build a unified alarm management system for multiple alarm sources?

This article introduces the best practices of unified alarm management to help enterprises better deal with the challenges and problems brought about by heterogeneous monitoring systems.

Background Information

In the cloud native era, the scale of enterprise IT infrastructure is getting larger and larger, and more and more systems and services are deployed in the cloud environment. In order to monitor these complex IT environments, enterprises usually choose to use heterogeneous monitoring systems, such as Prometheus, Grafana, Zabbix, etc., to obtain more comprehensive monitoring data, so as to better understand the health and performance of their IT infrastructure.

However, this heterogeneous monitoring system also brings some problems, the most notable of which is the dispersion of alarm information. Since different monitoring systems may generate different alarm information, the information may be scattered in various systems, making it difficult for enterprises to fully understand the alarm status of their IT systems. This makes it more difficult to respond to alerts, while also increasing the complexity and workload of manual management.

To solve these problems, enterprises need a more unified and centralized alarm management solution to ensure that alarm information reaches the right people in a timely manner so that they can quickly take necessary actions to deal with potential problems.

Pain points of alarm management

Scenario 1: After the enterprise migrates to the cloud, the alarms of the products on the cloud are inconsistent

image.png

In a typical cloud-native business application deployment architecture, the following products ACK, ECS, and RDS are usually used. The application is deployed on Alibaba Cloud's ECS through Kubernetes and accesses the RDS on the cloud. In this architecture, the following monitoring products are usually used to monitor the system.

  • Monitor Alibaba Cloud infrastructure ECS and RDS through CloudMonitor, and send an alarm when resources are abnormal.
  • Monitor Kubernetes and Pods deployed on Kubernetes through Prometheus, and send an alarm when Kubernetes is abnormal.
  • Monitor applications deployed on Kubernetes through ARMS, including the direct call chain of the application. Alerts when the application is abnormal.
  • The log generated by the application is monitored through SLS, and an alarm is issued when the log is abnormal.

In such a scenario, since multiple cloud products are used to monitor the entire system, users need to repeatedly configure operation and maintenance configurations such as contacts, notification methods, and on-duty on multiple products. Moreover, the alarms between different systems cannot be organically combined, and when a problem occurs, related alarms in different alarm systems cannot be quickly associated.

Scenario 2: Under the multi-cloud and hybrid cloud architecture, the alarms of the heterogeneous monitoring system are not unified

image.png

When an enterprise's application is deployed in a multi-cloud environment or a hybrid cloud environment, the alarms generated by the monitoring system may be more scattered and complex, which brings great challenges to the operation and maintenance of the enterprise. Due to the differences between different cloud platforms and private cloud architectures, the collection and processing methods of monitoring data may also be different. Therefore, the alarm information generated by different monitoring systems may also show differences, which will bring a series of problems.

First of all, the alarm information generated by different monitoring systems is scattered in different places, and operation and maintenance personnel need to spend more time and energy to process these information. Secondly, it is difficult to manage and analyze the alarm information generated by different systems in a unified manner, which makes it more difficult to diagnose and solve problems. In addition, managing and processing alert information from different systems can become more complex because there may be duplicate or conflicting alert information.

Scenario 3: self-developed monitoring system, custom event alarm access

In the process of application development and operation and maintenance, as the scale of the system expands and the complexity increases, the glue codes in every corner gradually increase. Although these codes are an important link connecting different modules and systems, once a problem occurs, it is difficult to find and deal with it immediately because it is scattered in different places. This makes it difficult for enterprises to ensure high availability and stability of the system. How to flexibly and cost-effectively access the alarms generated by this part of the code has also become one of the pain points of enterprise application operation and maintenance.

Unified alarm management

In the process of building a unified alarm management platform, different monitoring systems have different alarm definitions and processing procedures, and the following problems often exist:

  • Alarm formats generated by different systems are different, and access costs are high.
  • After the alarms between different systems are connected, it is difficult to unify the processing logic due to the inconsistent format.
  • Different alarm systems have different definitions of alarm levels.
  • Different alarm systems have different processing methods for alarm automatic recovery. Some alarm systems support automatic recovery, while others do not.

The integration, event processing flow, notification strategy and other functions designed by ARMS alarm management [ 1] are specially aimed at the scene of unified management of alarms, and solve many problems encountered in the process of unified management.

How does ARMS alarm management access alarms in different formats?

Traditional alarms usually include the following content. This structured alarm is usually only applicable to a single alarm source. When the data of multiple alarm sources is aggregated together, it usually leads to the conflict of the data structure. Therefore ARMS uses semi-structured data to store alarms.

Alibaba Cloud monitoring alarm data format:

image.png

Zabbix alarm data format:

image

Nagios alarm data format:

image.png

Semi-structured alarm data structure

[
  {
   "labels": {
    "alertname": "<requiredAlertNames>",
    "<labelnames>": "<labelvalues>",
    ...
   }, 
   "annotations": {
   "<labelnames>": "<labelvalues>",
   }, 
   "startsAt": "<rfc3339>",
   "endsAt": "<rfc3339>",
   "generatorURL": "<generator_url>"
  },
  ...
]
  • labels (labels): alert metadata, a set of labels uniquely identifies an event, all events with the same label are the same event, repeated reports will be merged, for example: alertname: alert name.
  • annotations: annotations are additional descriptions of alarm events, and annotations do not belong to metadata. For example: message: alarm content. The labels of the same event occurring at different time points are the same, but the annotations can be different. For example, the annotation of the alarm content may be different, for example: "The CPU usage of the host i-12b3ac3*** has been greater than 75% for three minutes, and the current value is 82%".
  • startsAt (alarm start time): The start time of the alarm event.
  • endsAt (alarm end time): The end time of the alarm event.
  • generatorUrl (event URL address): Alarm event URL address.

As shown in the above code, ARMS refers to the open source Prometheus alarm definition [ 2] , and uses a semi-structured data structure to describe alarms. Alarms are described through highly scalable key-value pairs, so that the alarm content can be expanded very flexibly to access alarms generated by different data sources.

User-defined alarm access capability in any JSON format

ARMS alarms provide the ability to access any JSON format (custom integration [ 3] ). As long as the alarm data structure meets the JSON format, it can be accessed. As shown in the figure below, custom alarm access needs to first upload the JSON data in the alarm to the ARMS alarm center, and then map the key information in the alarm content to the ARMS alarm data structure by editing the field mapping on the page.

image.png

ARMS defines key fields such as alertname. For more extended fields, users can configure them by adding extended fields in the integration. All extension fields can be used in the subsequent alarm processing logic. The following figure is an example to map the hostname field in the original alarm message to the extended hostname field, and map the hostip field to the extended hostip field.

image.png

Commonly used monitoring tool alarm quick access capability

By default, ARMS provides the alarm access capability of various monitoring systems on and off the cloud. You can refer to the integration overview [ 4] for quick access.

image

How does ARMS alarm management unify alarm levels?

In ARMS, alarms are divided into four levels: P1, P2, P3, and P4. By configuring the mapping table, multiple different types of grades are normalized to four grades P1-P4. As shown in the figure below, the three alarm levels with different descriptions of L1, Critical, and serious alarms are mapped to P1 alarms, so that different definitions of alarm levels in different systems can be unified.

image

How does ARMS alarm management unify the processing logic for alarms in different formats?

Since ARMS alarms use a semi-structured data structure, labels can be used to unify the alarm processing logic. Usually we need at least 2 tags to unify the processing logic of the alarm. A tag is used to determine who should be notified of the alarm, such as a business tag (service, biz). Another label is used to determine how the alarm application will be notified and upgraded. As shown in the following table, the alarm severity is usually used to define the SLA of alarm handling.

image

ARMS has designed two strategies, notification strategy and escalation strategy, to meet the processing requirements of different levels of alarms. You can refer to the notification strategy best practice [ 5] for configuration.

Label Design Principles

When we design business labels for alarm processing, we need to meet the following principles:

  • Mutual exclusion principle: refers to avoid using two or more tag keys for the same resource. For example: If the tag key service has been used to identify the business, do not use similar tag keys such as biz or business.
  • Collective exhaustive principle: All resources must be bound to the planned tag key and its corresponding tag value. For example, if a company has 3 businesses and the tag key is service, there should be at least 3 tag values ​​representing the 3 businesses respectively.
  • The principle of limited value: refers to retaining only core tag values ​​for resources and deleting redundant tag values. For example: a company has 5 businesses in total, so there should be and only labels for these 5 businesses to facilitate management.

In addition to business tags, other tags can also be defined for alarm management, such as using environment tags to distinguish alarms from development and test environments. These tags should meet the above design principles, which can simplify the complexity of alarm management configuration.

Label the alarm through the event processing flow (enrich the alarm)

How to mark the alarms of different alarm sources after we design the label. In ARMS alarm management, a low-code event processing flow [ 6] is designed , and the ability to label alarms (enriched alarms) can be realized through drag-and-drop configuration.

Scenario 1: Labeling an alarm after matching a specific condition

A certain xx business uses a self-developed monitoring system. After the self-developed alarms are connected to the ARMS alarm management through custom integration, these alarms need to be uniformly marked with the business label xx. The event processing flow is configured as follows:

a. Log in to the ARMS console [ 7 ] , select Alarm Management in the left navigation bar, and then click New Processing Flow.

b. Create an event processing flow in the pop-up panel, edit the trigger condition to match the name of the custom integration as "xx self-developed monitoring system".

image.png

c. Add the Set Service Label action, and set "xx" as the service (service) label value.

image

Scenario 2: Cutting strings and extracting tags

All hosts in a self-developed alarm system are named using a fixed format, the naming format is env−env−{env}-{biz}-app−app−{app}-{group}-${index}, which needs to be extracted The biz field is used as a business label. After configuring the correct trigger conditions, use the split content operation to split the hostname according to the character '-', and fill the split content into the env, service, app, and group fields in turn.

image

Scenario 3: Enriching alarms by querying Excel tables

An application monitoring platform only notifies the application ID when an alarm occurs, and needs to associate information such as the application name and the person in charge of the application according to the Excel table.

image.png

a. Create an Excel data source and upload the app_cmdb.xlsx file.

image.png

b. Configure the event processing flow, add field enrichment operations, and select the data source as the data source created in the first step. Edit the matching field as appId, and fill other fields in the Excel table into the appName, owner, and ownerPhone extension fields respectively.

image.png**
**

Scenario 4: Enriching alarms by calling external services through Serverless (FunctionCompute)

As in the third scenario above, when the missing data in the alarm needs to be obtained from external systems such as CMDB, the alarm can be enriched through the API type data source.

image

a. Create a function computing application [ 8] , develop an HTTP service, receive input parameters as appId, and return output parameters as appName, owner, ownerPhone and other parameters. The following screenshots are just sample codes.

image

b. Create a data source of the API type, and the URL address is the function developed in the first step.

image

c. Configure the event processing flow, add field enrichment operations, and select the data source as the data source created in the previous step. Edit the matching field as appId, and fill other fields in the Excel table into the appName, owner, and ownerPhone extension fields respectively.

How to configure automatic alarm recovery in ARMS alarm management?

Different monitoring systems have different processing logics for automatic alarm recovery. For example, the Prometheus alarm will not send a recovery alarm in a specific format, and only the alarm time is used to identify whether the alarm is over. The status of whether the alarm is restored in Alibaba Cloud Monitoring [ 9] is merged into the alarm level, as shown below.

  • Parameters: triggerLevel
  • Data type: String
  • The level at which the alarm is triggered this time. Value:

<!---->

    • CRITICAL: serious
    • WARN: warning
    • INFO: information
    • OK: normal

Alarms in different scenarios may have different logics for processing whether to recover or not. Such as the threshold type alarm, when the monitoring value does not meet the threshold condition, it is expected to restore the alarm immediately. However, for important alarms of the event type, the alarm occurs only for a moment, and there is no recovery process. The operation and maintenance personnel need to manually confirm that the impact of the event has been eliminated before the alarm can be restored.

Scenario 1: For the alarms that will not be recovered, configure the automatic recovery time, and the alarms will be automatically recovered according to the time

For event-type alarms, it is usually necessary to manually confirm the impact range of the event before processing the alarm. At this time, the automatic recovery of the alarm may cause the event that needs to be processed to not be processed manually. In view of this situation, it is necessary not to automatically recover after receiving the alarm, or at least not to automatically recover within a long period, and to give the processing personnel a certain amount of time to confirm the impact of the alarm.

image

ARMS custom integration configuration alarm automatic recovery time screenshot:

image

Scenario 2: Configure the recovery alarm field, and recover the alarm after receiving the recovery event

In the alarm integration of ARMS, you can configure the alarm recovery field. When the value of a certain field in the alarm content meets the conditions, it is regarded as a recovery alarm. Find the corresponding alarm according to the contents of other fields of the alarm and restore it. The schematic diagram of alarm active recovery is as follows:

image

Screenshot of ARMS console configuration method:

image

Alarm recovery needs to meet the following two points in order to recover the corresponding alarm correctly.

  • If the deduplication field is not defined, the labels of the alarm and the recovery alarm must be completely consistent in order to recover the alarm correctly.
  • If the deduplication field is defined, the deduplication field of the alarm and the recovery alarm must be completely consistent in order to recover the alarm correctly.

Note: When a field such as (status) is configured to complete the alarm recovery field, please do not add this field to the alarm mapping rule. This usually results in a mismatch between the alert and restore alert fields, and the restore fails.

Additional information

FunctionCompute sample code:

# -*- coding: utf-8 -*-

import logging
import json

def handler(environ, start_response):
    context = environ['fc.context']
    request_uri = environ['fc.request_uri']
    body_str = get_request_body(environ)
    id = json.loads(body_str).get('appId')
    # 这一行为伪代码,示例通过查询cmdb获取应用详细信息, 获取到的app格式如下
    # {"appId":"b38cdf95-2526-4d7a-9ea9-ffe7b32*****", "appName": "iot-iam", "owner":"王五", "ownerPhone": "130xxxx1236"}
    app = cmdb.getApp(id)
    ret = json.dumps(app)
    status = '200 OK'
    response_headers = [('Content-type', 'text/plain')]
    start_response(status, response_headers)
    return [ret.encode('utf-8')]

def get_request_body(environ):
    try:
        request_body_size = int(environ.get('CONTENT_LENGTH', 0))
    except (ValueError):
        request_body_size = 0
    request_body = environ['wsgi.input'].read(request_body_size)
    return request_body

Related Links:

[1] ARMS alarm management

https://help.aliyun.com/document_detail/214753.htm?spm=a2c4g.2362717.0.0.1890245ddgeRkP#concept-2075853

[2] Prometheus alarm definition

https://prometheus.io/docs/alerting/latest/clients/#sending-alerts

[3] Custom integration

https://help.aliyun.com/document_detail/251850.htm?spm=a2c4g.2362717.0.0.18906bf4Pry1jD#task-2021669

[4] Integration Overview

https://help.aliyun.com/document_detail/260831.htm?spm=a2c4g.2362717.0.0.1890d928BoEXFr#concept-2078267

[5] Notification Policy Best Practices

https://help.aliyun.com/document_detail/456953.htm?spm=a2c4g.2362717.0.0.1890951awN1Sbk#task-2249792

[6] Event processing flow

https://help.aliyun.com/document_detail/311905.htm?spm=a2c4g.2362717.0.0.18901c8dwhrptl#task-2114624

[7] ARMS Console

https://account.aliyun.com/login/login.htm?oauth_callback=https%3A%2F%2Farms.console.aliyun.com%2F#/home

[8] Function Compute Application

https://help.aliyun.com/document_detail/51783.htm?spm=a2c4g.2362717.0.0.189070368lSswF#multiTask782

[9] Alibaba Cloud Cloud Monitoring

https://help.aliyun.com/document_detail/60714.htm?spm=a2c4g.2362717.0.0.18904bf99bofq7#task-2151109

At present, the application real-time monitoring service ARMS provides a full-featured 15-day trial, and developers can fully experience the alarm capability. Click here to get it.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3874284/blog/9914048