[Architecture] Common technical points--monitoring and alarming

Guide: It is necessary to collect common architectural technical points as a project manager to understand these knowledge points and solve specific scenarios. Technology needs to serve the business, and the combination of technology and business can bring out the value of technology.

Table of contents

1. Service Monitoring

2. Full link monitoring

2.1 Service Dialing Test

2.2 Node detection

2.3 Alarm filtering

2.4 Alarm Deduplication

2.5 Alarm Suppression

2.6 Alarm Recovery

2.7 Alarm Merging

2.8 Alarm Convergence

2.9 Fault self-healing


1. Service Monitoring

The main purpose of service monitoring is to accurately and quickly find out when a service has a problem or is about to have a problem, so as to reduce the scope of impact. There are generally many means of service monitoring, which can be divided into the following levels:

  • System layer (CPU, network status, IO, machine load, etc.)

  • Application layer (process status, error logs, throughput, etc.)

  • Business layer (service/interface error code, response time)

  • User layer (user behavior, public opinion monitoring, front-end burying point)

Monitor components in the field of operation and maintenance management (network->device->system>application->components)


2. Full link monitoring

2.1 Service Dialing Test

Service dialing test is a monitoring method to detect service (application) availability . The target service is periodically detected through the dialing test node, which is mainly measured by availability and response time. There are usually multiple dialing test nodes in different places.

Service dial-in test realizes the transition from passive complaint to active discovery by simulating user login/query. The currently supported dial-in test protocols include HTTP (including HTTPS, GET and POST methods), TCP, and UDP.


2.2 Node detection

Node detection is a monitoring method used to discover and track network availability and smoothness between nodes in different computer rooms (data centers) . It is mainly measured by response time, packet loss rate, and hop count. The detection method is generally ping, mtr or other proprietary agreement.


2.3 Alarm filtering

Filter some predictable alarms, and do not enter the data of alarm statistics, such as http response 500 errors caused by a small number of crawler visits, custom exception information of business systems, etc.

2.4 Alarm Deduplication

When an alarm is notified to the person in charge, the same alarm will not continue to be received until the alarm is recovered .


2.5 Alarm Suppression

In order to reduce the interference caused by system jitter, it is also necessary to implement suppression. For example, the instantaneous high load of the server may be normal, and only the high load that lasts for a period of time needs to be paid attention to.

Prevention: It takes more time to troubleshoot and deal with problems, which greatly reduces the efficiency of operation and maintenance, and because the root cause of the problem cannot be found in the first place, the time for troubleshooting is delayed, which often brings potential risks to business operations.


2.6 Alarm Recovery

Development/operation and maintenance personnel not only need to receive alarm notifications, but also need to receive notifications that the fault is eliminated and the alarm returns to normal.


2.7 Alarm Merging

Merge multiple identical alarms generated at the same time. For example, if multiple sub-service loads are too high in a microservice cluster at the same time, they need to be merged into one alarm.


2.8 Alarm Convergence

Sometimes when an alarm is generated, it is often accompanied by other alarms. At this time, an alarm can only be generated for the root cause, and other alarms converge into sub-alarms and send notifications together. For example, when a CPU load alarm occurs on a cloud server, it is often accompanied by an availability alarm of all the systems it carries.


2.9 Fault self-healing

Real-time detection of alarms, pre-diagnosis and analysis, automatic recovery of faults, and opening up of peripheral systems to achieve a closed-loop of the entire process.

Alarm self-healing is a complete set of fault automatic processing procedures. By connecting monitoring tools, alarm platforms, task scheduling platforms, CMDB, ITIL and other related systems, it realizes alarm reception, root cause location, rule matching, script execution, and fault recovery. Manual confirmation, and finally alarm recovery, truly realize the full lifecycle management of alarms.


Extension: Fault Classification:

Intermittent type: rapid self-healing after the fault occurs Repeated
type: one or more indicators of a single object continue to alarm
Ranged fault: a ranged fault occurs in a certain area or a certain cluster, and multiple objects within the range have alarms simultaneously in a short period of time .


Expansion: Learn from the coping ideas: a company’s product, a one-stop alarm lifecycle management platform, provides AIOps closed-loop capabilities from monitoring to abnormal detection alarms, and root cause analysis for compressed alarms

 

Guess you like

Origin blog.csdn.net/weixin_43800786/article/details/130798126