Case|Jiujiang Bank Zabbix monitoring system practice

Zabbix monitoring platform construction process

Jiujiang Bank Zabbix monitoring system practice is divided into three parts:

1. The construction process of Zabbix monitoring platform

2. Zabbix practical experience sharing

3. Prospects for future monitoring.

Background of the project

Establish a new integrated basic monitoring platform. In order to meet the needs of digital transformation, new technology development and business continuity. In order to improve the level of operation and maintenance automation and intelligence, and improve the level and ability of safe production and operation.

picture

There are four pain points of the original monitoring platform:

1. Some functions are weak

2. Lack of automation

3. Poor adaptability to new technologies

4. Manufacturers support is not enough.

picture

Selection of the basic monitoring platform: Integrating factors such as independent controllability, cycle, cost, and operational risk, combined with the actual situation of our bank, we choose an open-source basic monitoring platform, plus human service or project mode, to reduce the implementation and operation process. risks of.

picture

Through comparison and selection, we choose Zabbix as a basic monitoring platform. The advantages are as follows:

Zabbix can fully meet the demand for basic monitoring products;

Zabbix's functionality, architecture system, and service system are comparable to commercial basic monitoring software;

Zabbix has a complete service support channel and training support system in China;

For Zabbix solutions, there is no difference between the community edition and the paid edition.

picture

The Zabbix monitoring platform construction stage is divided into four stages: research, construction, deepening and expansion.

In the construction stage, Zabbix's technical monitoring platform, unified event platform, and intelligent analysis platform were built;

In the deepening stage, the docking of the platform was done and the basic functions of the platform were improved. At present, the migration of the old and new basic monitoring platforms has been completed;

The exploration stage is to gradually explore automated operation and maintenance scenarios such as intelligent analysis, system root cause analysis, and fault self-healing.

picture

picture

Benefits of the basic monitoring platform project

1. Achieved full coverage of basic monitoring indicators;

2. Realize the standardization, standardization and automation of basic monitoring;

3. Improve operation and maintenance efficiency and liberate operation and maintenance productivity;

4. Improve the level of digital operation and security operation.

Zabbix practical experience sharing

Sharing of Zabbix practical experience, this is the architecture design of Zabbix monitoring platform. On the whole, we adopt the server proxy distributed architecture. The monitoring object is to connect multiple proxies. The server adopts a single-group active-passive mode, and the database adopts a double-group data replication mode, and then realizes the failure of the server and database through the table level component. Seamless switching, in the middle is the connection between Zabbix and other systems, through various methods such as export, API, and SQL statements.

picture

Currently Zabbix's monitoring objects include operating systems, databases, middleware, and hardware devices. The monitoring objects are very comprehensive. Among them, the operating system, database, and application are all implemented by agent custom scripts. The middleware is realized by JMX and agent custom scripts. VMware uses the monitoring template that comes with Zabbix, servers and network devices. The SNMP protocol is used for monitoring, and there are many monitoring protocols for storage devices, including SNMP, SNMP Trap, SSH, Rest API, etc.

picture

The existing monitoring scale of production includes more than 9,500 hosts, 580,000 monitoring items, 360,000 triggers, 98 Proxy templates, and 64 users. With so many hosts and a huge amount of monitoring items, the actual NVPS produced is 1,800. If calculated according to ten times the NVPS, it is 18,000. The scale of 90,000 hosts monitored by a host is very large.

picture

Zabbix practical experience sharing - production failure case

The optical module of the server produced often has problems, and sometimes restarting cannot solve the problem. There are two situations:

1. Complete damage, loss of link redundancy, and certain operational risks;

2. Overtime causes the business to run slowly. The solution is: monitor the optical link between the host and storage through the optical fiber switch, and deploy a collection program on the management machine of a dedicated optical fiber switch. The management machine can automatically discover the host by associating with the relevant templates and monitoring items, the management machine sends the collected monitoring data to the server through the Zabbix Sender tool for monitoring and alarming.

Zabbix is ​​a good tool that you can use flexibly in your daily monitoring.

picture

Zabbix is ​​connected to the unified event platform and intelligent analysis platform to realize unified alarm management, trend prediction and update analysis. The collected data will use the real time export function (real-time export) of Zabbix Server in architecture design, push it to ES in real time through Firebeat, and provide a unified event platform and intelligent analysis platform for consumption. This is an overall process.

picture

picture

capacity analysis

First collect monitoring data, put the data into the database, periodically extract capacity data through SQL statements, and put them into the historical database. For example, the virtual machine displays real-time virtual machine information in Zabbix. From this level, the historical library is necessary, which can save the data for a longer period of time and enable better subsequent analysis. There are currently two scenarios for capacity analysis:

1. Analyze the capacity information that is about to exceed the threshold every day, and send email reminders to clean up and expand capacity in advance. In this way, you will not be woken up by the duty call;

2. Capacity analysis report. Use the capacity analysis report to avoid confusion, how many servers to buy next year and how much storage to buy, this kind of problem that has plagued us for a long time can be solved.

picture

Automatic expansion of file system and space - fault self-healing. The automatic expansion mechanism uses Zabbix's powerful fault detection and action execution capabilities. When it is detected that the use of the ASM disk space of the file system database of the operating system and the data table space of Oracle reach a certain threshold, the script action is triggered, and the storage device, VMware platform and operating system are connected to the storage device through the SSH protocol to realize storage. Operations such as allocation, disk identification, and addition finally realize the vacuum of the file system and table space. At present, the production practice has expanded the capacity of ASM in about 30 seconds, and the expansion of the table space file system within 10 seconds, and the efficiency has been improved by at least 10 times.

picture

Statistical reports are displayed through Grafana, displaying content: the number of CPU usage alarms, the number of memory usage alarms, the number of SWAP space usage alarms, the ranking of TOP 100 alarms, etc. Most of the statistical report data is retrieved from the database through SQL statements. In addition, there are operating system statistics, software version statistics, etc., as long as the SQL statement is written, the data can be found.

The analysis of the Zabbix database table structure can be analyzed from the user Debug mode, official document API reference, Zabbix database creation statement, Zabbix source code, etc., in order to achieve the effect of writing the desired SQL statement.

picture

Practical experience sharing - monitoring optimization

1. Monitoring configuration optimization, including removing unnecessary monitoring items, triggers, graphics and automatic discovery rules, reasonably setting the collection frequency of monitoring items, rationally using active and passive modes, and reasonably setting the storage period of historical and trend data, triggers The rules should be as simple as possible, monitor the type of numerals used as much as possible, and use macro variables to configure and manage variables in the template. When the port monitoring timeout, the configuration is optimized to display Zabbix LLD and preprocessing functions;

2. Custom scripts. It is recommended to add a timeout parameter in the script or add a nodata alarm. This timeout phenomenon needs to be discovered in time.

Practical experience sharing - version optimization

1. Regularly upgrade the small versions of Sever and Proxy during the life cycle. I encountered the following faults in production: it was version 5.0.4 at that time, and the fault was that data was occasionally lost. I checked for a long time and contacted the official side to help find it , I have not found out the reason, and in the end there is no way to upgrade the minor version, and it will be solved after the upgrade;

2. It is necessary to pay attention to the functions of the major version and evaluate whether an upgrade is required. Standardization of monitoring includes standardization of monitoring management, standardization of naming, and optimization of deployment. There are relatively many deployment optimizations, including the use of high-availability deployment models, the use of SSDs, and optimization of databases and parameters.

picture

Zabbix practical experience sharing - daily operation and maintenance

1. For the performance of Zabbix, you need to pay attention to the number of images in the queue that exceed 10 minutes, and monitor Zabbix Server Proxy logs and its own performance. Use Zabbix official templates;

2. Ineffective monitoring of Zabbix. Including unsupported monitoring items, automatic discovery rules, unknown triggers; disabled hosts, monitoring items, automatic discovery rules, triggers, etc.;

3. Zabbix monitoring verification, including naming conventions, clock synchronization, the host is not included in the monitoring, the host has not added the corresponding component monitoring template, etc., need to be verified;

4. Database. It is necessary to pay close attention to the status of the database, including database logs, database high availability, database partition table, backup, performance, etc. for abnormalities;

5. Application panel. Utilize visual components such as Dashboard and Grafana, and customize the panel from the perspective of the application.

A Look into the Future of Surveillance

insert image description here

Prospects for future monitoring—from operation and maintenance to operation

1. It is to emancipate the mind. The ideology should be business-oriented and based on data, so that monitoring can better serve the business and realize the digital transformation of monitoring;

2. Platformization of operation and maintenance tools, further improving and integrating various existing tool chains, realizing standardization of operation and maintenance data and platformization of operation and maintenance tools, and further strengthening self-service monitoring capabilities;

3. Observability monitoring system. Build an observability monitoring system based on indicators, logs, and call chains;

4. Intelligent monitoring. Utilize big data technology, combined with massive operation and maintenance monitoring data, realize alarm root cause analysis and fault self-healing through machine learning and intelligent algorithms.

Guess you like

Origin blog.csdn.net/Zabbix_China/article/details/130594302