Case|Zabbix monitoring architecture sharing of a city commercial bank

Editor's recommendation:

A city commercial bank where the author works has successfully completed the migration of application system monitoring to the Zabbix platform, and will share the development and growth experience of the Zabbix system from the aspects of architecture deployment, monitoring dimensions, automation solutions, and operation management. The author of this article is also in the "Zabbix Technology Exchange Group", welcome to join the exchange.

AcidGo: Infrastructure operation and maintenance engineer of a city commercial bank

In 2018, the Zabbix monitoring system was used and popularized within the company. Zabbix was used in the internal production environment to monitor multi-dimensional monitoring coverage of multiple storage devices, thousands of operating system hosts, and various database products, and developed multiple custom monitoring Scripts, application monitoring framework and front-end page display.

Zabbix Platform Overview
Platform Introduction
Zabbix is ​​an enterprise-level open source solution that provides distributed system monitoring and network monitoring functions based on a Web interface. It can monitor various network parameters, ensure the safe operation of the server system, and provide a flexible notification mechanism to allow system administrators to quickly locate and solve various problems. With Zabbix, it is easy to reduce the heavy server management of operation and maintenance personnel tasks to ensure the continuous operation of the business system. Its backend uses a database to store monitoring configuration and historical data, which can be easily connected to channels such as data analysis and report customization, and a rich RESTful API is opened on the frontend for third-party platforms to call. The overall architecture is very good under the current DevOps trend. dazzling.

The selection process
We started to contact Zabbix in 2017. The main monitoring system used in the operation and maintenance before was Nagios, but Nagios's page display, monitoring configuration, automation and other functions are not particularly friendly to the operation and maintenance personnel of the infrastructure. And Zabbix, which is in the limelight, just caught our attention. In the operation and maintenance of the infrastructure, it is necessary to face various monitoring scenarios, such as PC server fault light inspection, storage device array health judgment, minicomputer LPAR resource monitoring, operating system multi-path inspection, etc. wait. Zabbix provides built-in monitoring methods such as SNMP, IMPI, SSH, Agent, etc., which can be well adapted to each layer of the system architecture. The Agent also supports custom tools, and the overall performance is very flexible. In terms of web front-end management, Zabbix can satisfy the monitoring management of various granularities, from the entire cluster to a single monitoring item, which can be subdivided and controlled. The customized dashboard and historical data visualization function also greatly facilitate the review of monitoring data by operation and maintenance personnel . Based on the above considerations, the industry chose Zabbix as a new monitoring platform pilot, starting from the monitoring of basic resources, first taking over most of the storage, host and operating system to Zabbix.

Current status of use
The Zabbix system, which was piloted within the scope of the infrastructure at the end of 2017, has gradually evolved from version 3.2 to the current version 4.4, which has experienced milestones in various monitoring systems. The current Zabbix system has also gradually expanded from the original small-scale trial to a wider range of scenarios covering hardware, applications, platforms, and businesses. The architecture has also evolved from a single data center to a distributed deployment of three centers. In addition to gradually replacing the old monitoring system, more and more third-party systems have also begun to connect to Zabbix, such as automated operation and maintenance platforms, continuous release platforms, and operation and maintenance visualization platforms. The operation and maintenance monitoring data realizes the working mode of intelligent operation and maintenance.

Not long before writing this article, we also successfully completed the migration of application system monitoring to the Zabbix platform. As an operation and maintenance personnel who participated in the promotion, implementation and automated development of the Zabbix system, it is a great honor to witness the robust growth of our operation and maintenance force. Here , I will also share our experience in the development and growth of the Zabbix system from the perspective of architecture deployment, monitoring dimensions, automation solutions, and operation management.

In the operation and maintenance management of the hardware monitoring
data center, the vertical depth of the system architecture is very steep, including the most basic hardware equipment, which requires the operation and maintenance personnel to spend their time in inspection and troubleshooting. However, as the number of equipment in the data center increases With explosive growth, manual inspections can no longer meet the real-time and reliability requirements of current monitoring. For this kind of low-level monitoring, Zabbix's multi-dimensional features solve this problem very well. Its built-in SNMP/IPMI protocol can easily connect to out-of-band monitoring of related hardware devices.

At present, we use the passive method of SNMP Agent to regularly inspect the basic indicators of hardware devices, such as fault light signals, power supply, memory information, disk arrays, etc. The equipment collects hardware information and regularly updates it to CMDB. For example, the following are the configurations automatically discovered in some Huawei RH2288 V3 IBMC monitoring templates:

picture
The operation process of configuring hardware monitoring in Zabbix is ​​also very convenient. Most of them are configured on the web interface. You only need to define the SNMP Agent/Trap interface or the IPMI sensor target port to flexibly define monitoring items. For the configuration of IPMI monitoring, it is mainly to fill in the name of the sensor. At present, we use relatively little IPMI out-of-band monitoring. It is mainly used by some Inspur PC servers. Out-of-band management such as DPM of VMware vSphere.
**When choosing a monitoring protocol for hardware monitoring, one principle to keep is: if you can use SNMP, you don’t need others, if you can use SNMPv3, you don’t need SNMPv2. **Because SNMP can realize automatic discovery very flexibly in Zabbix, and SNMPv3 can provide a more robust authentication mechanism, because network security risks must also be considered while opening hardware monitoring. The configuration of a single SNMPv3 monitoring item is as follows, and most parameters provide input windows:

picture
For the flexibility of automatic discovery of SNMP configuration mentioned above, this also depends on the principle of SNMP design. With the help of the tree structure index method, the number of existing elements can be enumerated according to the index field, and then the next element can be traversed according to the length of the number. layer elements. For this kind of traversal, Zabbix itself provides a friendly discovery[{#SNMPVALUE},OID] function to complete, seamlessly connected to the internal common automatic discovery data structure. The principle of the entire SNMP auto-discovery mechanism is as follows. picture
Since the initial pilot project of Zabbix starts from infrastructure operation and maintenance, and Zabbix is ​​very convenient for SNMP/IPMI protocol configuration, it can often be based on the mib file and mib documentation provided by the manufacturer. The monitoring that needs to be customized can be filtered out, so that the busyness of the management system can be reduced by reducing collection, and the monitoring quality can be optimized. For example, the following is the SNMPv3 monitoring effect of ThinkSystem SR650 customized and configured according to the mib description of Lenovo XCC out-of-band management system (http://www.circitor.fr/Mibs/Html/L/LENOVO-XCC-MIB.php):

picture
The power supply, array, disk, etc. in the above figure are all generated by automatic discovery rules. The same template can be used for XCC out-of-band servers with different array card numbers, network card numbers, and channel numbers. The device changes It is fully maintained by Zabbix. In addition, to share an experience in customizing SNMP monitoring process, first collect all the indicators that need to be monitored in the MIB file, group the filtered indicators, find the OID of the highest parent index of each group, and then use snmpwalk on Zabbix Proxy Traverse this OID to find all OID content, distinguish between Index and Detail, divide regular monitoring and automatic discovery monitoring, and finally use snmpget to obtain OID values ​​one by one to determine the corresponding value type on Zabbix. Special attention should be paid to the fact that snmpwalk is traversal and does not require the complete value of OID, while snmpget is retrieved based on a complete OID. Corresponding to Zabbix, snmpwalk is similar to automatic discovery, and snmpget is similar to regular monitoring items.

Storage monitoring
In the data center, storage devices are very core and critical infrastructure, and any related alarm will alert the operation and maintenance personnel. In the process of promoting Zabbix's storage monitoring, I realized a very thorny difficulty, that is, storage is not just a hardware device, and the SNMP protocol cannot obtain in-band performance information, but Zabbix cannot be installed like mainstream operating systems Agent is used for data collection. For the handling of this kind of problem, our accumulated experience is that external interfaces such as RESTful are preferred to obtain monitoring data. If this condition is not supported, monitor on the Zabbix Proxy server through a custom monitoring package manufacturer recommended tool or method .

Zabbix Agent supports O&M personnel to customize monitoring, encapsulates the execution command into a Zabbix Item Key for Zabbix to call, and also supports additional security policies, such as AllowRoot can set whether root is allowed to execute the agent, and the UnsafeUserParameters parameter can filter special symbol injection. Our standard for custom configuration, taking the RedHat baseline as an example, a monitoring class in the /etc/zabbix/zabbix_agentd.d directory is saved in the form of a conf file, named ClassA_ClassB_Detail.conf, and the defined execution files are placed in /usr/local/zbxexec/ClassA/ClassB/xxxx.xx.

For the method of customizing monitoring items, it is convenient to connect with the product monitoring methods of various storage manufacturers, and encapsulate the monitoring commands suggested by the manufacturers as a monitoring item of Zabbix. Such encapsulated methods are mainly CLI, RESTful and SSH, such as the following monitoring methods we currently use for each product:
insert image description here

In addition to communicating with manufacturers to connect with Zabbix, in fact, you can also use the open source ecology and the cooperative promotion of Zabbix. Many companies will share Zabbix experience, templates, and tools to Zabbix Share, which can be used after consideration. At the same time, Zabbix has been working hard to cooperate with other manufacturers to jointly launch the official monitoring templates of each manufacturer on Zabbix, such as the monitoring templates of each product launched by DELL EMC in Zabbix (https://www.zabbix.com/integrations /emc).

Through the above-mentioned monitoring methods, Zabbix’s monitoring effect on storage devices in the production environment makes the operation and maintenance personnel quite satisfied. The agentless architecture avoids the intrusion of important devices. At the same time, relevant storage alarms can also be triggered in time, and help storage managers quickly discover problem, location reason.

Host Monitoring
Our current host monitoring mainly includes Power minicomputers and x86 ESXi. Such objects have a very obvious feature, that is, the number and information are not fixed. A minicomputer may need to divide physical partitions or virtual partitions for a newly deployed database, or adjust the CPU allocation of a certain database; a vSphere cluster may expand the number of ESXi hosts or resources, or create a new cluster. In such a changing environment, the first consideration is to use Zabbix's automatic discovery to adapt, and this scenario has a very obvious similar feature, that is, a master control terminal is needed to manage the entire host resource pool. Therefore, the principle we often adopt in monitoring hosts is to automatically discover hosts by monitoring the main control terminal, and let the discovered hosts automatically use the corresponding template.

picture
The above monitoring process is mainly realized by Zabbix's automatic registration host, which is different from the automatic registration monitoring item mentioned in the hardware monitoring. The automatic registration here will automatically register a host to be monitored directly according to the resource list obtained by the master. , the relevant host configuration including host name, visible name, agent interface, etc. will inherit the master control, and then bind a pre-configured monitoring template for each host. If the control end finds that a certain host is not in the resource list collected last time, it will automatically delete the host after the resource retention policy time expires. Example of auto-discovered ESXi hosts:

insert image description here

Operating system monitoring
The monitoring of the operating system is very large. In addition to the variety of operating systems, the number of monitoring items in each operating system also covers a wide range. Multiplying the number of physical machines and virtual machines, the entire monitoring area will be very large. In addition, the operation of hosting each server into Zabbix has become extremely cumbersome. In this regard, our idea is to let the server automatically report to Zabbix through automated means, optimize the template to reduce repeated monitoring, and customize the dependencies of the trigger.

The monitoring of the operating system is implemented using the Zabbix Agent solution. Zabbix has also launched agents for various operating systems, which can be run directly without compiling. In this regard, in all our virtual machine baselines, minicomputer backups, and physical machine Ansible deployment scripts, the Agent installation and configuration corresponding to the operating system will be prepared in advance. Among them, it is recommended to use the passive method, and mainly modify the following contents of the agent configuration:

# ...
# 众多 Zabbix Proxy 中的两个
ServerActive = 10.10.32.1,10.10.32.2
# 其中 10.10.32.0/24 为当前机房的 Zabbix Proxy 节点网段
Server = 127.0.0.1,10.10.32.0/24
# Hostname 是这台服务器的管理 IP
Hostname = 10.10.33.1

This configuration is mainly to facilitate the translation of the agent among multiple Proxies, and it is very convenient to ensure the continuous effectiveness of the agent in scenarios such as failure recovery and Zabbix upgrade. In addition, the local loopback address is also written into the Server, so that the local script needs to be called through the agent in this operating system later. Hostname is not necessary in passive mode, configuration management IP can ensure the convenience of active mode and configuration management.

The above are only the configuration standards of the agent. If it needs to be automatically reported to Zabbix, other steps are required. At present, we have implemented a mechanism for automatically reporting hosts for the x86 operating systems of physical machines and virtual machines. A report will be made at 8:00 a.m. every day, and then new hosts will be automatically added to maintenance mode to avoid various non-critical exceptions during the deployment phase An alarm storm is brought about, and the maintenance mode will not be exited until the system is stable. In the deployment of physical machines, in addition to a complete set of automated RAID configuration and PXE installation systems, we also have an Ansible solution for operating system configuration baselines. In the roles of each operating system, there is an Install Zabbix Agent And Report. task so that this host can be added to Zabbix with standard naming by implementing the configured good vars. For a large number of virtual machines, we wrote a set of Python scripts to scan the virtual machines in the vCenter of each computer room to obtain the daily differences of the virtual machines, and then use its attributes in the CMDB and comments on the vCenter to fill the business system , application cluster, server description, etc., and finally register to Zabbix. This mechanism not only greatly liberates the operation and maintenance personnel from monitoring and registering the host of the new system, but also specifies the management strategy in the script to achieve various additional expected goals. The following points are listed:

· According to the network segment information, connect the server interface in the same computer room to the Proxy in the same computer room to avoid traffic crossing in the computer room.

· By judging the vps of each Proxy in Zabbix, connect the new host to the Proxy with low load.

· Fill the existing information in the CMDB into the tag and asset information of the registered host.

The topology of its architecture is simplified as follows:

picture

Under this automatic mechanism, the dislike of monitoring configuration by operation and maintenance personnel is greatly reduced, and the association with our CMDB is also added, laying the architectural foundation for the future work order system. However, our analysis and exploration of the monitoring system did not stop there. Considering the large number of alarms caused by triggers in operating system monitoring, we also studied some additional measures to avoid the emergence of too broad alarms.
First, the template is subdivided into each category as the template of the base class, and then according to the application scenario, specify which base classes the upper-level template is composed of, so as to avoid repeated monitoring caused by the monitoring of similar functions in too many customized templates. Then, specify strict dependencies on the triggers in each template to avoid storms caused by joint triggering of alarms. For example, for the partition capacity monitoring trigger of the Linux system, we have formulated the dependencies between several watermarks:

picture

Database monitoring
Database monitoring is also a tight string in every operation and maintenance personnel's heart. In addition to ordinary table space usage, number of sessions, SGA usage, ASM usage, cache hit, cleaning frequency, etc., there are also status checks such as downtime and switching . Coupled with the fact that we have implemented distributed databases in recent years and tried domestic databases, more and more database products need to be connected to Zabbix. At present, our monitoring of the database has combined a variety of monitoring ideas and formulated monitoring indicators for various database products, which have shown excellent performance in the scenarios of performance data traceability and fault alarms.

In the traditional architecture of the financial industry, the Oracle database is often an indispensable base. Through template customization, we provide templates for various architectures such as Sinlge-Instance, RAC, DG, and F5, covering most of the Oracle DBA concerns. monitoring items. A high-performance Proxy is specially used in Zabbix to execute Oracle monitoring scripts through custom monitoring. In addition to emergency fault alarms, Zabbix has now become a tool for DBAs to analyze database performance and compare historical data to troubleshoot database problems, which also relies on the large amount of monitoring information saved by Zabbix. The following are some of the monitoring indicators for database performance and RAC:

picture
In addition to databases with traditional architecture, we also provide comprehensive monitoring for other database products, and adopt the idea of ​​root service (RootService) for this monitoring, and incorporate the database into monitoring more automatically. The advantage of this method is that it can perfectly demonstrate the mechanism of Zabbix's automatic host registration, add the database to the monitoring, and use the automatic registration monitoring prototype to identify the details of the database that needs to be monitored. At present, we have written monitoring scripts for OceanBase, Sequoia Database, MySQL, DRDS and other database products, and run self-management in Zabbix with a new database monitoring architecture. The workflow of this framework is as follows.

picture
Take MySQL monitoring as an example to describe the whole process in detail, see the figure below.

picture
In the configuration zbx_mysql.py file of the MySQL instance, add a testenv_zabbix database instance. This file is only read by the zabbix user through the acl setting. When the host as MySQL RootService performs automatic discovery host monitoring, the newly added instance configuration will generate Zabbix auto-discovery json rules, create a monitoring instance according to the configuration information, and additionally use the MySQL basic template. In the MySQL basic template, a series of special monitoring automatic discovery rules are configured. For example, Discovery MySQL Replication Enable will execute the show slave status command on the discovered instance, and the script of MySQL RootService will still be called here. If the target instance is found to have enabled the master-slave , a {#REPLICATION}: “enabled” field will be included in the josn returned by automatic discovery, thereby triggering the master-slave replication monitoring item to take effect.

Create a logical host as the main control service, automatically register the host in a chain-divergent propagation mode, and then judge whether additional monitoring configurations need to be added based on the automatic discovery in the template. This monitoring method used in our innovation has achieved great results. Good results make the monitoring system more intelligent, and it is not necessary to write the connected user and password to the Zabbix macro like some database monitoring, but only to ensure that the read configuration file is the smallest on the ACL of the file system Permissions are enough, which improves the access security of the database. Another point is that many distributed databases now adopt the architecture of control + computing + storage. For example, TiDB's PD is responsible for metadata management, DB is responsible for SQL analysis and calculation, and KV is responsible for underlying key-value pair storage. It is also good for many partitions. Regardless of the copy, the most effective monitoring method is to directly connect to its control components, map the starting point of the Zabbix master control service to the control of the database cluster, continuously layer the architecture, and solidify the monitoring items between each component into automatic discovery Rules to achieve accurate and effective monitoring coverage. Taking our current OceanBase distributed database monitoring as an example, the monitoring details such as OB Zone, OB Tenant, OB Server, and OB Partition are automatically distributed from OB RootService as follows.

picture

In addition to the continuous flexibility of the monitoring architecture, we are also considering more in-depth monitoring effects. For database monitoring and troubleshooting, DBAs hope that all related alarms are at the same time point, which is more convenient for horizontal reference. Based on this starting point, the operation and maintenance personnel use the idea of ​​"monitoring snapshots" to concentrate as many monitoring items as possible at the same time point, which can also greatly reduce the performance loss caused by frequent interaction when monitoring the database. This is achieved mainly with the help of custom script specifications and related project dependency features in Zabbix monitor items. Taking the monitoring of Jushan database as an example, the importance of reducing the frequency of interaction is just explained by the fact that the snapshot action of Jushan database has a certain performance consumption.
picture

The Coord host of the Sequoia database cluster in this test environment has a total of 56 monitoring items. If each monitoring item needs to be connected to one of the coordinating nodes separately and take a snapshot to obtain the corresponding monitoring indicators, then the cluster will not respond to these frequent snapshots. The operation incurs a very large performance cost. But in fact, there are only 3 real monitoring items here, which start with multi_snapshot_SDB_SNAP_* in the screenshot, and the rest are derived from these three monitoring items, which means that the number of interactions can be reduced from 56 to 3 . Our implementation of this scheme is to generate a JSON containing each sub-monitoring item through custom scripts or LDD macros, and set it to not save history records. This is very important because the sub-monitoring items are generated in the parent monitoring It is also meaningless to save a large number of redundant JSON strings that are split and calculated before the item is dumped. In the generation of sub-monitoring items, the preprocessing operation of Zabbix monitoring items is mainly used, and the corresponding K/V is extracted using JSONPath, and then the final sub-monitoring value is obtained through multiples/changes per second/regularization. Although the preprocessing will consume Zabbix's CPU, the CPU does not increase significantly after increasing the StartPreprocessors parameter, and Zabbix Proxy is distributed and scalable, and this bottleneck is also very easy to solve through expansion. After a comprehensive evaluation, the returns brought by this solution are considerable. When a database problem occurs, perhaps more DBAs want to look back at the monitoring history, and they can see the indicators on the same time plane as follows:
picture

Application monitoring
When the dimension of monitoring considerations rises to application components, we still insist on using automated methods to face the dazzling monitoring requirements, thinking about how to make application monitoring more vital, and also analyzing the application monitoring when using Nagios in the past The pain points encountered, in the end, write a framework-level tool to take over the monitoring life cycle of the application. This framework is called zbx_app internally. It runs through a file server and application monitoring guidelines. It can automatically complete custom script pull, version iteration, automatic registration of monitoring items, etc. Operation and maintenance personnel only need to write an application monitoring declaration file, other work is completely performed by the framework.

The internal principle of this framework is mainly to send specific monitoring items and automatic discovery through Zabbix as a signal to update the configuration and automatically check the basic environment. If the relevant module version of the file server is found to be updated, it will actively pull the file to achieve self-management operation. This module that receives specific signals and manages its own environment is called the base class internally. All module monitoring will have an automatic discovery rule to interact with the base class. If the base class declaration file contains the requested automatic discovery module, then Will answer, let Zabbix perceive and use the returned results to generate monitoring items for this module. After the corresponding monitoring items are generated, each monitoring item will use the snapshot method to capture the monitoring target once, and then the monitoring related items will split each sub-item at the same time point. At this stage of calling, it also interacts with the base class, but the base class will call the fixed interface of the synchronized module method according to its module name. These interfaces are the development guidelines for writing such modules, and the purpose is to ensure that the base class It can be called and parsed smoothly.

picture
Limit the object of consideration to one host and simplify the process. You can have the following timeline, from top to bottom is the direction of time progress.

picture

Zabbix Proxy will call Discovery APP to automatically discover, trigger the host to initialize the current zbx_app base class once, and the base class will also scan the environment after receiving the signal, mainly to collect the statement files (lps.cfg) in each directory, and do according to the statement file A basic environment configuration pull, and then return Zabbix Proxy related information.

Zabbix Proxy has received the base class to take effect, and the automatic discovery of other modules will follow up with a detection of the host and send their own automatic discovery signals.

If the base class finds that there is a redis module declaration in the declaration file, the internal information will be integrated into an automatic discovery return structure, and Zabbix Proxy will generate corresponding monitoring items after sensing it.

The Redis module continues to send monitoring snapshot requests to the base class, and the base class will call the redis module that has been pulled from the HTTPFileServer to execute the monitoring request and return the result set.

If the application maintainer writes the monitoring requirements of the url module into the declaration file during the process, the next time the Discovery APP signal is received, the base class will find the new declaration and quickly go to HTTPFileServer to pull the specified version of the url monitoring module. The monitoring of subsequent URLs will continue to take effect like Redis until the declaration file is deleted or the module is annotated.

If the automation developer updates the code of the redis module in the version library during the process, the base class will also compare the MD5 column to find the version update after receiving the Discovery APP signal next time, so as to pull and replace it with the latest version.

The self-management base class is used to realize the closed-loop management of application monitoring. In terms of monitoring operation, the function of the development module of the automation developer is subdivided, and it also gives the application maintenance personnel more freedom to declare the application monitoring they need. Moreover, the actions of the base class are also included in a monitoring item, which can ensure the stability of the base class itself, so that no one will know after the strike.
Through this framework, we have successfully migrated the application monitoring from the previous generation Nagios system to the Zabbix system, and the maintenance cost has become lower and the operation mode is more stable.

Business monitoring
With the support of a powerful application monitoring framework, the operation and maintenance personnel also began to pay more attention to the upper-level business monitoring. The characteristics of the business are also the most straightforward manifestation of the stable operation of the entire architecture. Zabbix provides database monitoring, directly writes SQL statements on the web interface, and the Proxy/Server loads the corresponding driver through unixodbc to connect to the target database, and finally returns the execution result. At present, we use this method to query the database of business status information, transaction success rate, equipment activation, running batches, etc., and then send alarm information through the alarm channel of Zabbix to the technicians who have subscribed to this business system in the whole bank. Writing SQL directly on the web interface can adapt to the variability of business queries. When a subsystem is added to the monitored business, it is only necessary to connect a new table in its monitoring items without modifying the custom script. In addition, this method can reduce the maintenance cost of custom scripts. ODBC provides a variety of database drivers. On the web page, everything at the bottom layer is packaged. There is no need to consider how to initialize the connection, how to create a cursor, and how to release session etc. After obtaining the monitoring data of each business, it can be smoothly connected to the dashboard of Zabbix, and a monitoring panel can be created for each business to achieve the effect of large-screen display.

picture

In practice, for this convenient monitoring method, we should pay special attention to a few points:

Reduce the permissions of the odbc connection configuration file odbc.ini on the file system, only ensure that zabbix users can access, and modify it by special users.

Standardize the SQL writing rules, and do not allow statements with high execution costs. This also requires good communication with the DBA.

As much as possible, let the result set return one row or even one row and one column.

Allocate data calculation operations to Zabbix preprocessing, and cannot push too many calculation operations down to the database level.

From the perspective of the various monitoring
dimensions discussed above, on a vertical coordinate of operation and maintenance, from hardware monitoring to business monitoring, it realizes all aspects of monitoring coverage from the bottom layer to the top layer, and multiple layers ensure that the monitoring system can Quickly find problems and faults, and send accurate alarms. However, the operation and maintenance personnel's thinking about monitoring has not yet ended, and then they think about a question, whether the monitoring point is enough to not only move on the vertical axis, but to move forward to the "future" time point. For example, it is not only after the user logs in to the system to conduct a transaction that a failure is triggered to generate data to be monitored, but to periodically detect the pre-conditions of a transaction. Even when there is no customer to conduct a transaction, Zabbix will detect it on the transaction page Whether the required resources are available. In this way, it can be guaranteed that the page abnormalities of some trading platforms will be discovered earlier than real customers, and the effect of "prediction" can be realized logically. In addition, multiple operator lines can be used for page monitoring to cover customer cases more comprehensively. When abnormalities are found, it can also be compared with other lines to see if there are operator network problems, such as CDN, blacklist, etc.

We use the monitoring method of Zabbix Web, through the network isolation of firewall and NAT, use Squid to realize line selection, and use automation tools such as selenium to simulate page actions, and realize the monitoring of the internal and external network and operator line level of the online banking system.

picture


In addition to the monitoring mentioned above, platform monitoring has a special scenario that is difficult for operation and maintenance personnel to avoid. That is, some systems come with a monitoring platform, but the mainstream monitoring currently in use cannot be compatible with or replace it . This problem becomes more apparent when components are introduced and products gradually increase. For example, the mobile development platform mPaaS comes with monitoring components such as monitorkernel, corewatch, monitorguard, etc., which play a very important role in the operation of the entire mPaaS platform. If you consider replacing it with other monitoring systems, the cost of technical adaptation will be huge, and It also lost the stability of the platform itself. In this regard, we decided to use self-built external channels and Zabbix Sender to realize the external platform to connect to Zabbix in the form of monitoring flow, which can not only avoid the intrusion of the original third-party monitoring system, but also allow its monitoring data to converge to Zabbix .
First of all, here is an introduction to the principle of Zabbix Sender. When the monitored host is in active mode in Zabbix (most hosts are in passive mode), the host can build an existing monitoring item data and send it to Zabbix, and Zabbix will also serve as The monitoring item data of the host can also achieve real-time monitoring. When the upstream of the docking host is a third-party monitoring platform, the whole process looks like a dynamic data flow, continuously flowing from the upstream to Zabbix. For the implementation of the upstream third-party platform, we currently have the following solutions:

· When the external platform supports HTTP RESTful monitoring connection, connect it to an HTTP server specially responsible for receiving such alarms, and set an independent resource path handler for it.

· When the external platform does not support monitoring docking, but supports alarm push, you can adjust the accepted alarm level to the minimum or full amount, and send it to the HTTP server, TCP/UDP server or mail server.

· When the external platform does not have any channels to send information, it will choose methods such as web crawler and database monitoring, which will also lose the characteristics of monitoring flow.

Based on the current situation, there is almost no third situation. Generally, HTTP extrapolation monitoring or alarms are provided, so this type will be connected to our dedicated HTTP server written in Golang. Every time we add a docking platform , just add the OtherReader and ZBXSender interface implementations in the corresponding handler. But it should be noted that for triggers in this way, you need to pay attention to the dependency of the change()/diff() function, because sometimes the frequency of pushing the same monitoring item will be very high.
Still taking the mPaaS mentioned above as an example, through this way of docking with its core monitoring platform corewathc, all alarm monitoring of this platform can be obtained.

picture

Alarm notification
The topic of monitoring coverage is over here. Let’s discuss the next link, which is how to push the alarms triggered by monitoring to the technical personnel who need to receive them. In fact, Zabbix itself also provides a very variety of alarm push channels, and you can also customize scripts to process the alarm content and then push it to the channel. But we chose to lengthen the push channel a little bit, so that the alarm can play more role.
If the alarm cannot be pushed, then the significance of monitoring will be less than half, but if there are too many pushes, the alarm will have no value at all. Zabbix's alarm message structure is just a string, some special symbols mixed will trigger exceptions in the subsequent serialization actions, or the Proxy that sends the alarm is down, but a large number of alarms cannot be sent out. All kinds of troubles caused by this, presumably every technician who has been in contact with the alarm will be deeply touched. In order to solve these problems of the last 100 meters, we are constantly trying various methods, and we are still trying to find a breakthrough.
We have written an internal tool named Zabbxi Robot, which can perform JSON parsing according to the predetermined Zabbix alarm string, and after preprocessing, it will judge whether the alarm needs to be suppressed, whether it is a jitter that needs to be converged, and then put The alarm is pushed downstream. In addition, in the architectural design of Zabbix, a Proxy and Server responsible for the main alarm push are configured to monitor each other, and the Server will have two push channels to notify when the main channel fails, thereby avoiding blank alarms.

picture
The short message cat here uses gnokii to call the serial interface to realize the sending of short messages. The sending efficiency is very low. Only when the detection finds a major fault in the Zabbix architecture, it will be sent to several operation and maintenance personnel in charge of monitoring the system. Both the backup channel and the main driver implementation method transmit the alarm information to Zabbix Robot in the form of curl POST. The difference is that the two use different Headers to distinguish specific channels in Zabbix Robot and filter them. . Zabbix Robot is an HTTP server developed by us using Golang. It currently has two modules, the suppression module is triggered first, and the convergence module is triggered later. The suppression operation here takes effect with LimitUnitGroup. The current trigger condition satisfies the triplet ({flag, number, second}) of each gorup. It will use the flag to derive a goroutine of InhibitionUnit, and if the flag exists InhibitionUnit, then it can be judged to be in the suppressed state and will not be sent out.
InhibitionUnit is also a triplet ({flag, number, second}). When it is judged to be an alarm of flag, it will exit the suppression after receiving number number or exceeding second seconds. In Zabbix-Robot's configuration file, each flag and its attributes are subdivided. Generally speaking, the currently commonly used flag is Zabbix's TRIGGER.SEVERITY, which is the alarm level.
The convergence module is a function currently in the testing phase. It implements the Weights method first, which is to judge the hostid X+itemid in the samples received from the alarm and the past 30 minutes (adjustable)Y+triggerid*Z total score, if it exceeds the predetermined value K, it will be judged as jitter and then converged. X/Y/Z here is a customizable weight. For example, you can increase the X to make the convergence weight lean towards the host, and then the number of alarms sent by the same host will be converged first. In addition, there are other convergent methods, such as string features and machine learning, which are also the directions we are trying.
After the suppression and convergence are completed, the alarm will flow to our automation platform, and the platform will judge whether to derive a work order, and then send the alarm to the designated person in charge through the subscription system, and the person in charge will feedback the processing status in the work order system. The subscription system here is also connected to some metadata of Zabbix. The tag function (tags) launched by Zabbix after 4.0 is very convenient for alarm screening. If you have used K8S, you can understand this screening process as Labels and Selectors.

The current alarm push process solves most of the original problems and enables Zabbix to provide strong support for the work order system and subscription system. In the future, even dual nodes can be deployed to achieve high availability.

Report generation
In order to allow Zabbix to have more report display capabilities, we have also customized the front end to connect some commonly used reports to Zabbix. At the same time, Zabbix also provides asset list, custom topology, dashboard and other functions, and has certain report generation capabilities.
The page monitoring mentioned above is actually a flow topology of the internal and external network, which can connect pages at all levels back and forth, so that it can provide excellent problem location capabilities.

picture

In addition, there are daily updated capacity lists, hardware information, inspection reports, etc. Take one of the VMware virtual machine mutual exclusion checks as an example, which shows the application modules of each business system in the form of a tree diagram, and judges whether the redundant nodes are deployed on the wrong resources (ESXi or LUN or Cluster) In this way, it is convenient for the virtual machine administrator to separate the associated virtual machines, reduce the affected system modules when HA occurs, and also ensure that the construction of disaster recovery in the same city meets expectations.

picture

High availability
Prior to 5.0, there was no official Zabbix high availability solution, and we used a database-level recovery solution. Through the timing script, every morning (note that Housekeeping should be staggered as much as possible here) back up the tables excluding history* in Zabbix and the zabbix web front-end files to the disaster recovery computer room. In the event of an unrecoverable failure, Zabbix Server can be redeployed and the database restored, at the cost of losing historical data and trends, but can quickly restore the status of monitoring operations.
In addition, it is recommended that the Zabbix Agent's passive mode Server address be configured as the network segment of the Proxy, so that when the Proxy fails, it can also be quickly moved to other Proxies.

Future planning
In the two years since we used Zabbix, the operation and maintenance has also begun to be fully automated. In this context, we have increasingly realized the value of monitoring. Monitoring is not just an alarm broadcast, but requires more intelligence. way to tap the potential of monitoring. In the past two years, we have expanded the functions of the Zabbix system in various aspects, and we look forward to more development possibilities in the future. Now, for the future plan, count down, there are the following points.

Zabbix database selection
The Zabbix database architecture currently used in production is Zabbix 4.4 + Percona 8.0 + TokuDB. TokuDB is mainly used for the history* table, and the history* table is partitioned, while other configuration tables still use Innodb, TokuDB Use the QUICKLZ compression algorithm. In addition, the configuration of the database has been optimized. For example, settings that require performance costs such as double 1 are all changed to performance-oriented. When we first migrated to this architecture, the compression rate and QPS of historical data have been significantly improved, but with the increase in the number of monitoring items, we have gradually felt the bottleneck and pressure. Even if the historical data is compressed, it cannot suppress the rising trend, and the CPU iowait caused by deleting expired data every time after the housekeeper is enabled is also annoying. When querying a large amount of cold historical data, the long loading time is also crashing.
The Zabbix platform in our test environment manages several times the number of hosts in the production environment, which can be regarded as a real stress test scenario. In order to find the best practice, we have tested TokuDB, RocksDB, TiDB, Elasticsearch, TimescaleDB and other database products in the test environment, among which we have slightly modified a version compatible with TiDB 3.0 based on the 4.8 version of Zabbix (https://github.com /AcidGo/zabbix_tidb), but unfortunately TiDB’s support for foreign keys does not meet expectations, but it may be a good choice for archiving historical libraries. After the test, it was found that Zabbix 5.0 + PostgreSQL TimescaleDB 12 has an ideal effect. As a time-series database solution, it is very suitable for the application scenario for appending and range search of the history* table, and also supports compression beyond the specified number of days. The overall comparison is excellent For the currently used TokuDB solution. In addition, we will also consider using a database as archived historical data, enable a higher compression rate, and then deploy a read-only Zabbix Server for access to separate warm data.

Oracle database monitoring transformation
is different from MySQL or other lightweight databases. Oracle connection requires huge cost, not only the huge external driver, but also the path selection under the F5 architecture, plus huge monitoring items, often Some monitors appear to be crowded in the task queue. At present, the transformation of application monitoring has been realized, and the monitoring of MySQL, OceanBase, Jushan database, etc. has also been replaced with a lighter and more automated solution. Therefore, the transformation of Oracle database monitoring methods is also planned. The first is to write a connection pool middleware to manage the connection session of each library. Zabbix Agent's detection of the database will call the middleware through RPC and return the result, reducing the overhead of session creation and destruction. Then, plan a query to obtain as much batch data as possible, for example, the current table space usage of all tables can be obtained at one time, relying on Zabbix's monitoring item related items and preprocessing to obtain multiple monitoring at a point in time, thereby reducing RPC frequency of calls.

Alarm self-healing
At present, our automation has achieved remarkable results. The automation tool library provides a variety of solutions for processing operations. We expect that in the future, low-level alarms can use this powerful automation capability to achieve an automatic recovery mechanism.

Automatic discovery of nested hosts
In some chained architecture scenarios, we very much hope that with the help of Zabbix's automatic host discovery function, we can continuously branch down along the architecture layer to derive new hosts or host groups, such as Domain in Sequoia Database -> Database -> CollectionSpace -> Collection chain discovery mechanism. However, in the actual test, the function of automatically discovering the host can only be triggered once, and the automatic discovery of the next layer will lose the selected template. Related issues are also reported on the Zabbix Forum (https://www.zabbix.com/forum/zabbix-help/404802-how-to-nesting-automatically-discovers-hosts-did-i-encounter-a-bug). We look forward to similar flexible solutions in the future that can realize the creation of host monitoring at all levels in this way of extending the architectural context.

Container Platform and Distributed Application Monitoring
The development of containers has advanced by leaps and bounds in recent years, and various orchestration platforms have emerged one after another. From Swarm to Kubernetes, the evolution of container monitoring is also gradual. Zabbix supports Prometheus data collection after 4.0, and has taken a firm step in container monitoring. Combined with LDD macros, it can more flexibly connect to the JSON format. In the future, the two can be combined to achieve automated, container-level intelligent monitoring .

The use of agent2
From our experience in writing other operation and maintenance tools, Golang is very suitable for the development of binary tools for operation and maintenance management. The powerful concurrency capability, the stability and reliability of the static language, and the low learning cost all make the quality of the tool better. Significantly improved. The agent2 recently launched by Zabbix is ​​also developed by Golang, and provides interface specifications so that operation and maintenance personnel can customize the most suitable Zabbix Agent. We are also exploring this point in the test environment and gradually understanding the development process. We believe that we will promote the powerful agent2 when the production environment Zabbix is ​​upgraded to 5.0 in the near future.

The promotion of the Zabbix system, the development of supporting tools, the subdivision of monitoring dependencies, and the promotion of migration from the old monitoring system are all the achievements of our many operation and maintenance personnel who have accumulated and innovated in the past two years. I sincerely thank each of our operation and maintenance personnel for their efforts, and also place firm confidence in realizing the vision of the monitoring system planned for the future.

Guess you like

Origin blog.csdn.net/Zabbix_China/article/details/131323119