Long article | Observability monitoring based on Zabbix

01 Observability and Observability Monitoring

02 Observability monitoring based on ZABBIX

03 Exploration of Observability Monitoring

——Wang Xiaodong, a veteran of operation and maintenance for many years, author of "nginx Application and Operation and Maintenance Practice"

insert image description here

This article is compiled from Wang Xiaodong's speech at the 2022Zabbix Summit. ppt can reply "ppt" in the background of the official account.

1. Observability and observability monitoring

I have been engaged in operation and maintenance for nearly 20 years and have been using Zabbix 1.0 to 6.0. Based on years of use and the continuous evolution and development of Zabbix, I will share with you the observability monitoring based on Zabbix.

What is observability? How is observability monitoring implemented? Recognition of a new thing is the process of understanding the familiar, and the process of understanding, I think, is observability.

picture

For the matter in nature, the object of observability is matter, while for the matter in the cloud-native world, the object of observation should be the application in the microservice architecture.

picture

How to observe real natural substances? The three common dimensions of observability are the three pillars of Metrics, Logs, and Trace. When you open a page, you interact with the page. Does the page display the expected page? This faces consistency and security issues. Security is divided into two levels, 1. Detect whether it is safe; 2. Security technology reinforcement. With the continuous development of application knowledge and technology, there may be more observability layers and attributes. In nature, all matter assumes different states over time. For monitoring, each application evolves to a different state over time.

From the four-dimensional space-time theory, we need to observe the multi-dimensionality of the application from the perspective of time for observability monitoring, and look at various monitoring indicators related to observability from a point in time or a time interval.

Share my views on observability monitoring.

01 Objectivity . Objectivity has a low-embedded approach to the application being monitored. At the same time, it is necessary to observe or collect data from the objective perspective of a third party.

02 Systematic . You can’t look at a certain indicator alone. Observability had three dimensions in the early days, and then expanded and increased continuously. If you only look at one angle, the integrity will be ignored, and it is impossible to observe data in every aspect.

03 Relevance . When observing each monitoring indicator, in addition to being independent, it is also necessary to establish the correlation between all monitoring indicators and the external correlation of different applications to achieve overall observation.

04 Predictability . All things and applications and observations are based on the dynamic behavior of time. The purpose of monitoring is to detect risks early and avoid failures.

The cognitive view of observability monitoring is divided into the following two points.

01 The observability object of the cloud-native world is the observation object of the application observability of the microservice framework. Observability has inherent attributes and capabilities. The content of observability can be continuously explored in comparison with the physical and chemical methods of cognition of the material world.

picture

02 The four principles of observability monitoring, objectivity, systemicity, relevance, and predictability.

03 Observability monitoring must establish observation and analysis from the perspective of time. The ability of observability monitoring depends on our understanding of observability and related technical means.

2. Practice based on Zabbix observability monitoring

01 Zabbix and Prometheus achieve effective integration.

The actual situation covered by China's monitoring scenarios is that the existing architecture includes both virtual machines and cloud products such as K8s. Monitoring tools cannot meet the monitoring requirements through Zabbix or Prometheus. Therefore, it is necessary to configure monitoring items and alarm methods including monitoring. Each tool has an independent technical station, and the corresponding maintenance cost is relatively high.

Each alarm must be assigned a corresponding pre-value and alarm person. Methods such as: Prometheus is responsible for its expertise called Exporter and Kubernetes. Use Zabbix's HTTP method to query multiple Prometheus interfaces, through Zabbix's existing monitoring template functions, preprocessing and automatic discovery functions, and through Prometheus API, use Prometheus to obtain data from different monitoring items and automatically create monitoring items in Zabbix. Including pre-value creation, through a unified application relationship identification, automatically associated with the monitoring alarm person, the whole process is fully automatic.

When querying, query with the lowest standard, such as: CPU usage is 80% or 90%, and only 60% is required for alarming. 60% of the time, the data has been acquired in the Zabbix server, and through Zabbix's own multi-level pre-value and design capabilities, multi-level and alarm-level management is performed.

Optimized the architecture of Prometheus. Prometheus is not integrated in the Kubernetes platform. It is independent. During the implementation process, the remote reading and writing function of Prometheus is used to deploy multi-cluster and multi-point Prometheus, unify data, and summarize data through influxDB. Data display is performed through the read-only Prometheus, which solves the display problem.

Zabbix implements unified alarm configuration and unified alarm management. Of course, the entire architecture also has the concept of continuous monitoring during the Devops process. Monitoring is not just about monitoring the production environment, but the environment during the test process is fully monitored. The above benefits can be achieved through overall design transformation .

picture

02 How to do Elk integration. In monitoring scenarios, Zabbix is ​​still unable to effectively implement logs. Elk collects and displays logs better, but in the original Elk architecture, everyone is unified into one large cluster, which is not easy to maintain.

Based on the principles of agility and minimal practicality, multiple small clusters can be split. How do small clusters build and forward data? Use a centralized Logstash cluster for log collection routing. By comparing the logs in different Kubernetes clusters or hosts, they are sent to the Logstash cluster uniformly, and distributed to different Elk clusters according to different identifiers or application names.

As the business grows, build an Elk cluster with the smallest cluster, expand the cluster or perform more complex operations, and then synchronize and add multiple ES clusters. There are many Elk solutions, but more technical methods are needed. Just enable the HTTP plug-in in Logstash, write different alarm strategies to Redis, and when there is log filtering, re-filter the script through the filter of Logstash, and pass falsk Build a Webserver, sort out and modify the corresponding data into the Zabbix_sender data format and send it to the Zabbix server. The server also uses the monitoring template and the preprocessing and automatic discovery functions configured in the monitoring template. The monitoring items are automatically created in real time and do not need to be manually created. Create, according to the entire log level, perform cross-level alarms, integrate unified alarm people, do not need to configure more complicated alarm people, and send them directly to emails or DingTalk notifications,

Through the above overall design, the deployment structure of Elk is optimized, and the small cost is suitable for the development of enterprise operation and maintenance. The use of Zabbix alarm templates simplifies the operation and maintenance methods without manual configuration. It is fully automated and completes the implementation of automation.

picture

03 TRACE is integrated with Skywalking. Skywalking is based on the principle of minimal building. Building a Skywalking cluster online requires a lot of equipment and servers, so we chose the simplest server and ES database server.

The alarm configuration of cluster storage is based on files, and the configuration is complex and cannot be synchronized. Skywalking can be configured at one time, including alarm pre-value, alarm mode, etc.

Skywalking also has the ability to output HTTP, which can be output to falsk, build an external server, perform data correction, correct the alarm file data of Skywalking to the mode supported by Zabbix, send it to Zabbix Server, and perform monitoring template preprocessing, automatic processing, automatic It is found that the alarm creation is fully automatic and does not require maintenance. The deployment structure of Skywalking is optimized. According to the most applicable minimum construction principle of the enterprise, the management of alarm personnel is unified, and the only alarm personnel and event processing are realized.

picture

For any monitoring system, there are five modules: acquisition, storage, event, action and display. These five modules are split and divided into corresponding combinations and integrations of multiple tools.

01 The light architecture realizes the overall architecture design based on the principle of minimum application, without excessive waste or advanced design, and realizes the use of the entire architecture.

02 Modularization, open source tools are implemented with corresponding interfaces, such as: different tools are used for integration in Zabbix at different stages.

03 low embedded. Adhere to the principle of third-party low embedding and objectively collect data. For example: Prometheus is taken out from the Kubernetes cluster for independent deployment.

04 Micro-service. The operation and maintenance tools should be micro-serviced, call and implement each other through APIs, split each other's modules for processing, and uniformly based on Zabbix event processing capabilities and unified alarm output.

05 Realize agile operation and maintenance, modular split. The Docker environment does not use complex Kubernetes clusters, it is pure Docker. When the data is stored locally or on the corresponding storage server, fast recovery or construction can be achieved with the fast file collection of Docker compose in seconds.

06 Low cost. In the process of operation and maintenance, it is not only necessary to realize the operation and maintenance work, but also insist on reducing costs for the enterprise, do not over-design, and follow the most practical principles at present.

picture

How is the monitoring architecture implemented in terms of observability functionality and consistency? The business-based monitoring capability is business-driven, and whether it is available to users after the server is deployed.

For example: when the user logs in to the entire service link process, it needs to be monitored. Use the function of Agent in Zabbix to support scripts to write Python scripts.

When any exception is sent to Zabbix, business consistency and end-to-end detection are achieved, not only CPU or a resource target is monitored. Based on the processing capability of Zabbix's unified event, it can perform unified pre-collection, unified alarm, alarm suppression and associated noise reduction.

There is no need to spend too much energy on back-end data analysis. You can use the script definition function of Zabbix alarms, just continuously record the relationship between different applications in Redis, and read them directly. For example, associate with cmdb, record the corresponding relationship, and write a Python script to solve the problem. Based on this, it is only necessary to continuously optimize the functions provided by Zabbix to realize the functions of alarm suppression and associated noise reduction.

insert image description here

3. Exploration of Observability Monitoring

Observability monitoring, in cognition, needs to be observed from the perspective of time. Zabbix provides a very good function: based on the problem management page, it can realize multi-dimensional unified viewing in the same time interval.

All applications and events on the page will be recorded in Zabbix, and the time point or time interval can be used to view the multi-view and multi-dimensional problems of an application at this time point or time period, such as: Metric, Logs, Trace alarms question.

The concept of intelligent monitoring is to use data for storage and unified collection. Trend prediction can be done according to the time dimension, such as: CPU usage, and the simulation curve is continuously carried out according to the time dimension, so as to realize the prediction of possible trends at the next time point and make future predictions.

The data formed according to the time point can be represented by a function. Introduce Pythagoras: Mathematics dominates the universe, and the number of y=kx+b can cover the entire universe. When storing original data, only one formula is stored. There is a need for big data storage, and science may require us to further explore whether we really want to save the original data. Whether only one function can be saved.

picture

The application of the microservice architecture is constantly evolving and will become larger and larger, which may exceed the processing capabilities of our humans. The movie The Matrix has been anthropomorphic to demonstrate application governance scenarios, application stress testing, and chaos engineering.

Through artificial intelligence, build a controllable governance environment, carry out continuous self-destruction and real-time observation of various observability data, and perform feedback to realize application iteration. I think it is the practical significance brought to us by application observability. The above is what I shared, thank you!

Guess you like

Origin blog.csdn.net/Zabbix_China/article/details/131222444