Monitoring information push and pull

　　If the monitoring system experienced a lot of people will find that most of the messages are processed apm push, this is similar prometheus is pulling (pushgateway though is pushed, but from the perspective of prometheus or pull. Are monitoring system, why are there two different acquisition options do.

　　Push mode

　　Information Type

　　The main event push mode information, once an event is triggered, immediately send collection. Such information is generally not stored in the acquisition side, because of the unpredictability of events, I do not know exactly how much data is generated, such as monitoring http call chain, there is a http request there will be such a monitoring data unpredictable chain can not calculate the size of the occupation, will produce fundamental news sent out, so as not to take up too much memory, but on the process being monitored influential.

　　Message reliability

　　Because the message is push, push more initiative in the party. When the failed attempts to resend can be reached after a certain number added to the transmission queue (explode the risk of memory), you can own persisted to the file, and so on.

　　+ The sender to confirm the state of the scene is likely to cause duplication of data. After the message is sent, for example, a long time has not been acknowledged ACK, it will be sent again, if the network delay, the network or the flash case, retransmission, the receiver receives two results of the same data. This is generally heavy logic to a point in time a process has only one state, some of the characteristics of the software may be utilized to achieve this effect, e.g. opentsdb, he relies hbase file blocks splitting the combined process will duplicate data removed.

　　Pull mode

　　Information Type

　　Pull mode mainly sampling work, he pulled more of a statistical sample value or values, statistical values such as java, and the number of exposed gc gc time, is has been accumulated. Because there pulling pull mode interval, all not obtain the exact value of the status change, we can see before and after the pulling interval, in order to describe the changes in the middle, we can see the number of times within 2 seconds gc from 2 becomes 4, indicating the occurrence of 2 gc, gc what time it is within 2 seconds of it happening, can not know. Sample value e.g. cpu utilization, we can get the interval between each usage, but can not describe the middle of the process, for example, 1 second cpu utilization cpu time increased to 100%, and then lowered a second time to a second 50% If we intervals of 2, it may be seen that the cpu utilization rate is 20% to 50%, only to see the value and the value decreased, going from data point of view, that is a rise from 20% to 50% .

　　Message reliability

　　After the above scenario, if the message is a statistical value, then the message reliability is guaranteed, even if there is a pull overtime, does not affect the next to get the state is still new, missing from the value part of it and less influence, the latter state is a state of extending in the front.

　　If samples. Although a message lost, but the impact is not so great, after all sampling, just a sampling of the value lost. He himself is a state of reaction and does not accurately reflect a trend. We generally use is the single data point. Here to explain the applicability of sampling, the sampling is only useful for persistent problems, or cpu usage examples, such as energy usage down within 2 seconds, also when it is no problem, but when they are able to in each sample up, it means there are some problems, just every time data is collected to increase the probability. And a small probability.

　　He does not guarantee the reliability of the data, but does not affect our analysis of the data.

　　data processing

　　In the case of statistics, the data is in fact who can not read. Gc times above example, the information might be identified by gc times 2 seconds, instead of a line chart 2 seconds, although it can reflect the slope of the growth, but not intuitive, all data needs to be processed.

　　Real scene

　　We often see when pushing data are also sampled values and statistics. The majority of the architecture is a consideration, you can not say that a part of the data in the process of pushing, pulling part of the data taken in, this is easily confused.

　　We have not seen it to get the data model events, mainly two considerations, one is pulling the interval leads to an increase in the amount of data, easily lead to memory problems. Another reason is the need for reliable information event, which gathered itself when needed to maintain a good amount of cheap information, which in the end successfully pulled to take, which failed to maintain this information led to the development of the difficulty of straight rise.

Monitoring information push and pull

Guess you like