One of the monitoring system monitoring data collection

Monitoring system-monitoring data collection

With the development of the Internet, the complexity of operation and maintenance has increased exponentially; the complexity of various operation and maintenance platforms associated with it has also increased exponentially. In this scenario, we have been pursuing and discussing how to meet the needs of stable work to the greatest extent and ensure that our system is relatively clean and decoupled. The topic of monitoring platform is very big, but I still want to nag a little bit here. As the business becomes more complex, various forms will emerge. But in general, there are three parts: acquisition, storage, and alarm. This article talks to you about the first part of the monitoring system: data collection. If you don’t write well, don’t spray, but you can do it...

1. What data should the monitoring system collect?

Speaking of which data should be collected by the monitoring system? First of all, we should understand which data needs to be collected. The monitoring system may be used for real-time data viewing, may be used for historical state review, or used for abnormal alarms. These all need to collect accurate data. The purpose of data collection is to collect enough data to meet various business needs.

  • Basic data
    Basic data, basic indicators for observing server status. Including CPU, memory, network, IO and other categories, I will not list them all here.
  • Application data
    Application data refers to the status data of applications running on the server. For example, port survival, process survival, process resource consumption, etc. The biggest use of this part of data can add survival alarms to your own services and trace the historical occupancy of process resources
  • Business data The
    above two types of data are what we need to pay attention to, but in the usual stability work, it is not enough to have no problems with the above, and there may still be problems in the business. Therefore, we still need a definite indicator to mark our business operation status. We call this indicator a business indicator. To judge whether a business is completely normal, it does not mean that there is no problem if the service is not down. The logic problems caused by changes, the impact of incorrect data, and response timeouts caused by too much data cannot be discovered through the survival status of the application. Generally speaking, to judge the status of the business, we are used to observing these indicators: traffic, interface error rate, and interface call delay. These indicators can be used to judge the business status from multiple angles. Of course, depending on the business situation, there will be many other dimensions of indicators, all of which are included in the scope of business indicators;
Data model

To talk about the monitoring collection method of the monitoring system, we must start with the monitoring data model. The monitored data is actually the purest time series data. Then, when building a monitoring system, abstracting a unified data model should be the first step in design and architecture. General time series data includes four parts: data name (metric name/metric), label (tag/tags), timestamp (timestamp), value (value)

as shown in the figure above. It is the open-falcon data model. This data model is in Some customizations have been made on the basic time series model, and two fields, endpoint and counterType, have been added.

##### Collection method The collection method of the
monitoring system varies according to the source of our monitoring data. It is mainly divided into default collection, plug-in, detection, log, and buried point.

  • Default collection The
    default collection is generally collected in the default agent, such as the basic indicators of the machine such as cpu, memory, IO, and the related indicators of process monitoring, which are all included in this category. Such indicators are generally predefined in the agent, and the amount of indicators will not increase too much.

  • Exploratory collection
    As the name suggests, external detection-based collections fall into this category. Such as port monitoring, Ping monitoring, HTTP monitoring, network monitoring and so on. Exploratory collection is a plug-in collection. This type of acquisition is relatively lightweight and less intrusive to the system. Through a simple configuration, you can quickly see the effect. In addition to the network monitoring itself, this type of monitoring has a significant feature, which is that it relies heavily on the network. Once the network is jittery, false alarms are extremely likely to occur. Therefore, a multi-point detection method can be adopted, which can prevent the occurrence of false alarms to a certain extent.

  • Business Buried Points
    No one, better than business development classmates, knows their own system better, and knows which core indicators should be observed under what circumstances.
    Therefore, it is very necessary to standardize the business collection. We suggest a principle: Adhere to the principle that business indicator collection is part of the code unshakable (haha). The stability index of the business should be one of the contents of the development work. The operation and maintenance of the module itself should be one of the development standards of the project. For this reason, operation and maintenance needs to provide stable monitoring coverage standards: call traffic, return error rate, interface delay, etc. The formulation of all operation and maintenance standards without corresponding tools and platform support is empty talk. Therefore, the monitoring system should also provide business embedding SDKs in various languages ​​and a platform for quick and simple data collection.
    Most development teams have their own set of development frameworks. If they can go deep into the business, they can also consider further to integrate the buried point SDK directly into the unified development framework of the business line.

  • Log monitoring
    From the perspective of stability, in most cases, log monitoring and business burying points obtain the same data. It’s just that log monitoring is more flexible. When our service is a closed-source project, or when it is not possible to quickly modify the open source components used in a short period of time, log monitoring can quickly take effect. Generally speaking, log monitoring is divided into online log collection and offline log collection.
    1. Online log collection is generally more flexible. You can filter and calculate the time, content, and magnitude of the log through various custom operations, requiring only a small amount of configuration cost. However, this kind of log collection is more intrusive, requiring an agent to be installed on the machine, and real-time log analysis will take up CPU resources, which is likely to affect the stability of online services, so resource restrictions must be done.
    2. Offline log collection is to collect the logs uniformly and process the calculation in the center. The advantage of this collection is that it does not take up too much CPU resources and there is no bottleneck in the amount of logs. But at the same time, it will also take up a lot of network resources. In terms of timeliness, compared with real-time analysis, there will be some delays; and for the centralized processing of large batches of logs, the format of the logs must be standardized, so flexibility may be worse.

  • There are
    basic indicators for plug-in collection , as well as burying points, logs, and detection. what about others? If the user has his own collection method, but needs to report the data to the monitoring system, this scenario can be realized with plug-in collection. The plug-in collection provides a way to control the collection cycle by the monitoring system, and the user only needs to implement a cycle of collection logic to complete the data collection. The plug-in collection is more flexible and can provide user-defined collection methods, such as specific collection commands, such as jum. The plug-in only needs to report the standardized data developed by the monitoring system. Plug-in collection requires periodic execution of plug-in scripts on the machine. Therefore, the audit of this script must be checked. Whether it is harmful actions or resource consumption, it may become a hidden danger that affects online stability.

  • Custom index reporting
    In addition to the several collection methods mentioned above, the monitoring system should also support the function of custom reporting. The user-defined time series all hope to use the monitoring system for data visualization and alarm functions. At this time, a custom data reporting interface can be provided, whether it is collected by the local agent or a separate central collector. With the reporting of custom indicators, the monitoring system truly becomes a stable infrastructure. The collection agent of open-falcon provides such an interface. However, customization is supported, but the reporting behavior of users cannot be well regulated. Sometimes data is reported incorrectly or irregularly used. At this time, the monitoring system needs to manage dirty data.

Monitoring system data collection method

Generally, there are two collection methods: active pull from the center (prometheus) and automatic report from the client (others)

  • Active pull (prometheus) at the
    central end means that a central end periodically pulls data from the collection end on demand. The advantage of this model is that it can be pulled on demand without waste. But in this model, the center has taken too much pressure. Theoretically, there will be a big bottleneck in performance.
  • The client's automatic reporting (others)
    means that all data is treated the same and all are reported. Generally speaking, a unified proxy will be used to collect, and then multi-level components will be considered, or a layer of MQ will be added, which are all designs that can be considered. Generally speaking, if the amount of monitoring data is estimated to be relatively large. It is still recommended to adopt the mode of self-collection and centralized reporting. The data collection of open-falcon is a very good client-side automatic reporting mode.

Guess you like

Origin blog.csdn.net/qq_31555951/article/details/107618660