Best Practices for Using EasyOps Products: Agent Survival Monitoring

 Udimensional EasyOps platform has built-in agent survival monitoring!

As the underlying core component of automation/monitoring, Agent's availability directly affects the use of upper-layer functions, so we will pay close attention to its status. However, if there are network fluctuations, Agent upgrades, or machine failures, etc., which may cause Agent abnormalities, users hope that such abnormalities can be notified to the person in charge of platform operation and maintenance in time, so that he can perceive and deal with them in a timely manner. In the past, our platform did not have built-in monitoring of Agent status survival. On-site personnel used various bypass methods to monitor this. However, bypass methods cannot reliably perceive changes in Agent status in real time. Now the platform has built-in this function will solve this problem once and for all!

The component (gateway) responsible for managing the Agent state directly exposes the memory state data as an indicator into the alarm processing link flow, so that the upper layer can use this configuration alarm. In addition, the Agent status is reported along the trigger edge, so it will be very sensitive.

Adapted release version: 6.19.0!

1. Description

Agent is the component responsible for the client agent on the Easyops platform, which can realize functions such as monitoring and collection, resource discovery and tool execution. Since the survival of the Agent is very critical to the operation of the system, the Easyops platform has built-in monitoring functions for the Agent, including the following indicators:

 These indicators are collected by default in the Easyops platform, no additional configuration policy is required, only the corresponding alarm rules need to be simply configured.

2. Alarm rule configuration

⑴ Create a new alarm rule: First, you need to create an alarm rule to define the target range of monitoring, that is, the range of hosts you want to monitor.

 (2) Set the alarm indicator: In the alarm rule, select "Host Agent Status" as the alarm indicator, and set the threshold as "not equal to normal". In this way, when the Agent state is abnormal, an alarm will be triggered.

 ● Please note: The indicator of the Agent status is reported hourly, and can be triggered immediately based on the trigger condition (status change). In other words, when the Agent state changes from "normal" to "abnormal", a change in the value of the indicator will be triggered immediately. In order to ensure accuracy, the trigger judgment fills in a data point for triggering. (If you fill in two data points, the abnormal state will be triggered for at least 1 hour, so the alarm delay is too long.) This
means that even if the indicator is reported every hour, once the state changes, the system will immediately capture this change and trigger corresponding action. The purpose of this design is to ensure that the monitoring of the Agent state is timely and sensitive. Therefore, you can rely on this feature to quickly discover and deal with abnormal Agent status.

 ⑶ Add rich information to the alarm: You can add more information to the alarm to describe the content and context of the alarm more clearly.

 ⑷ Set the alarm template: set the template for the alarm message so that it contains key information and is easy to read.

【SLO event alarm】{ {time|ts2str:'%Y-%m-%d %H:%M'}} generates "{ {levelName}}" level alarm

Alert resource: { {target}}

Warning level: { {levelName}}

Warning message: 『 { {originContent}} 』

Operation Manager: { {instance|jsonpath:'$.owner[*].name'|unique|join:','}}

Time when the alarm first occurred: { {startTime|ts2str:'%Y-%m-%d %H:%M'}}

Duration from the first warning: { {duration|duration_format:'zh'}}

Event details: http://modify to your platform address/next/events/{ {eventId}}/detail

Policy details: http://modify to your platform address
/next/events/alert-rule/alert-rule/{ {ruleId}}/edit

After saving the configuration, when the agent status is abnormal, you will receive the following alarm (take DingTalk alarm as an example)

alarm

 recover

Supongo que te gusta

Origin blog.csdn.net/EasyOps_DevOps/article/details/131729193
Recomendado
Clasificación