Related Thinking and alarm system monitoring

background

There was some discussion about the monitoring and alarm system in the last group! Some students feel the system error log too many now, because every time a print error log, will lead to an alarm, resulting in a large number of alarms each day will receive a great alarm noise, it is easy to overlook valuable alarm.

Here are some ideas for this discussion:

  1. Should be in the code development stage, carefully printed on the error log, just in case there is perception of the user experience, only need to print
  2. Data using a large number of methods of artificial intelligence, to do alarm fitted with a noise reduction algorithm to do alarm

The above two ideas, first, relatively high demand for developers, and many times, it is difficult to judge whether the user experience sentient; second, the industry is not mature program, the feeling is not feasible.

my thought

The reason why the alarm noise, I understand, because our alarm systems are based on a system level, such as rpc call a timeout, and being given a certain interface calls, etc., and not a very good response at the operational level, health of the system, resulting in alarm when they see that we can not have an overall understanding of the system's health. I think that, if at the operational level, but also do a layer of monitoring and alarm, the situation will be much better; for example, we are an im system, that can be done to monitor the number of messages per second to create, and then year on year and likened some comparison, if at some point, the volume of messages has created significant decline, and then the police, at least when we received the alarm, the system must be sure a problem, combined with system-level alarm, the system can quickly locate where the problem a.

System-level monitoring

System-level monitoring, I understand should be divided into two situations:

  1. Monitoring automated monitoring, such as corporate governance service platform should rpc interface qps, tp99 and other data
  2. Developer's manual alert, such as an error of the system, print the error log, then the number of errors according to the alarm log

I think, for the sort of system error log, it should be a long process, in the first version of the system, those places should fight error, it is difficult to judge, and only on-line and found unreasonable, then continue Adjustment.

Business level monitoring

Monitoring the operational level, there are two main jobs:

  1. Determine business metrics
  2. Ready

Determine quantifiable business metrics, is the most important, such as the natural flow of advertisement systems, and other income

Guess you like

Origin www.cnblogs.com/xsirfly/p/11536185.html