[Reprint] Why should we give up Zabbix adopt Prometheus?

Why do we give up Zabbix adopt Prometheus?

https://mp.weixin.qq.com/s/Ul5f4xcHvTShOF9smHHUjQ

 

Yan Xiaoyu 51CTO technology stack  1 week ago
 

2017, we monitor the operation and maintenance system is primarily based on Zabbix as a mainstream solution. At that time service monitoring service this part of the database is used to monitor the operation and maintenance team.


Pictures from Pexels


Overall, Zabbix the function is very powerful, and the use of relatively simple, basically writing the script will be able to achieve the specified monitoring applications.


PS: now it is not the Zabbix, operation and maintenance team based Open-Falcon custom developed a unified operation and maintenance monitoring systems, of course, this is something.


We have tried container of MySQL in 2016, 2017 began large-scale application. After a substantial increase in the number of container MySQL, managed through the platform.


The original use of Zabbix has been unable to meet demand:

  • Too many monitoring indicators, if all access to Zabbix, the server can not afford (at the time of server resources).

  • Operation and maintenance of the database management platform to monitor alarms require linkage process.

  • Examples of additions and deletions need to monitor system operation and maintenance of the database platform automatically discovered instances.

  • Trend forecasting, statistical data quickly needs.

  • There is a demand for monitoring indicators drawing custom.


So we want the internal DB team selection a monitoring system to support these requirements.

 

Technology Selection

 

About monitoring database, we do not want to put too much manpower to maintain, after all, the monitoring system is not our main job.

 

It is necessary to choose a deployment of simple, low server resource consumption, while the monitoring system combined with alarm function.

 

Although open source monitoring system is still a lot, but the final assessment down, or choose a more lightweight Prometheus, able to quickly meet the needs of our database monitoring.

 

① ease of use
binaries start, lightweight Server, ease of migration and maintenance, PromQL rich computing functions, statistical dimension wide.

 

② high-performance
monitoring data to more time dimension statistics, time series database more suitable for storing monitoring data, higher performance index by time, millions of monitoring indicators, processing hundreds of thousands of data points per second.

 

③ Expansion of
Prometheus federal support clusters, can have multiple instances Prometheus produces a logical cluster, when the task of handling single-instance Prometheus Server excessive, through the use of functional area (Sharding) + Federal clusters (Federation) can be extended .

 

④ ease of integration
Prometheus community also provides a large number of third-party monitoring data collection to support implementation: JMX, EC2, MySQL, PostgresSQL , SNMP, Consul, Haproxy, Mesos, Bind, CouchDB, Django, Memcached, RabbitMQ, Redis, Rsyslog etc. .

 

⑤ visualization
comes Prometheus UI, you can query the data directly through the UI. Combined with the flexibility to build Grafana beautifully detailed monitoring trends.

 

⑥ powerful aggregation syntax
built-in query language, you can achieve the data query through the function PromQL polymerization. At the same time based on PromQL you can quickly customize alert rules.

 

practice


Monitoring purposes

 

Before doing the monitoring system, we need to define the purpose of monitoring.

 

In summary, when relevant content, just in line to see the CNCF Foundation Certified Kubernetes Administrator Mr. Zheng Yunlong based "SRE: Google decryption operation and maintenance," a summary of the monitoring purposes, summary is in place, so we came directly quoted.

 

Source reference:

https://www.bookstack.cn/read/prometheus-book/AUTHOR.md

 

Long-term trend analysis: by continuously monitoring the collection and statistical sample data, monitoring indicators of long-term trend analysis.

 

For example, by determining the rate of growth of disk space, we can predict in advance the need for resources for expansion in the future what time node.
Alarm: When the system fails or is about to occur, the monitoring system needs to respond quickly and notify the administrator, thereby enabling fast processing occurs or early prevention of problems and avoid business impact occurs.
Fault Analysis and Positioning: When the problem occurs, the problem needs to be investigated and dealt with. By analyzing the different monitoring monitoring and historical data to find the root cause and resolve the problem.
Data Visualization: directly obtaining visual information systems operating state, resource usage, and the service operation status through visual dashboard.
A monitoring system to meet more of these points, involving the collection, analysis, alerting, graphical display, complete coverage of the monitoring system should have the function. Here we explain how to create a database based on Prometheus monitoring system.

Our Monitoring System Architecture Overview

 

We started at the end of 2016 up to now, in the middle also experienced several architecture evolution. But given the reading experience, the program will not be replaced in this elaborate, we focus on to talk about the current architecture design and usage.

 

First look at the overall architecture:

One by one we introduce the contents of the above architecture diagram:

 

①Agent
This is our collection Agent with monitoring information Golang development, responsible for collecting and monitoring indicators instance log.

 

Monitoring index includes information about the host computer (an example of the container). Because we are a container deployment, the number of single instance in about 4-20.

 

As examples of additions and deletions operation and maintenance platform, examples of the information on the host may vary. Agent so we need to be able to perceive this change, to determine what information is collected.

 

Further collection interval do 10s, the monitoring can be done particle size finer, prevent sudden left out monitoring index jitter.

 

②Pushgateway
This is an official component provided we use, because Prometheus is to obtain data by Pull ways.

 

If you let Prometheus Server to pull the data for each node, then the service will be enormous pressure monitoring.

 

We did it in the case of monitoring thousands of instances of collection interval of 10s (of course, the use of cluster model can also be federal, but this would need to be deployed Prometheus Server. Coupled with something related to the alarm after comparing the entire architecture will become complex).

 

So take Agent to push data to pushgateway, then go pushgateway above Pull data from the Prometheus Server.

 

Thus in the case where the write performance to meet the Prometheus Server, a single machine can carry data of the whole monitoring system.

 

Taking into account the issue of monitoring data collected across the room, we can deploy pushgateway nodes in each room, while ease the pressure of a single pushgateway.
③Prometheus Server
The Prometheus Server to pull the upper pushgateway data interval is set to 10s. A plurality pushgateway case, it can be a plurality of groups.

 

To ensure high availability Prometheus Server, you can put together a Prometheus Server Disaster Recovery room, in front of Prometheus Server configuration and the same.

 

If you need to keep monitoring for a long time, you can also configure a larger time interval of Prometheus Server collection, such as 5 minutes, 1-year data retention.
④Alertmanager
before using Alertmanager, you need to define good alarm rule Prometheus Server above. Our monitoring system because it is used to DBA, so the type of alarm indicators can be unified management.

 

But there will be alarm thresholds for different clusters are different definitions or examples of how to achieve flexible configuration here, behind me repeat that.

 

In order to solve the problem Alertmanager single point (see below availability), we can be configured to three points, Alertmanager introduced Gossip mechanism.

 

Gossip mechanism provides a mechanism for information transfer between a plurality Alertmanager. Ensure timely receive the same alarm information in the plurality Alertmanager, only one alarm notification is sent to the Receiver.

 

Alertmanager supports multiple types of configuration. Custom templates, such as sending HTML mail; alarm routing tag match is determined according to how to process the alarm; recipient, support e-mail, micro-letters, WebHook plurality of types of alarm; inhibit_rules, reasonable configuration can reduce waste alarm (such as machine downtime, the machine alarms all the above instances can be ignored, to prevent alarm storms).

 

We are warning by Webhook way, the alarm will trigger pushed to specify the API, then secondary processing by the service interface.

Prometheus and high availability Alertmanager

 

⑤Filter & Rewrite module

 

The function of this module is to implement MySQL Cluster alarms and alarm content filtering rules rewritten.
Let me talk about alarm filtering rules, as mentioned above, it is a unified set of alarm rules, if the DBA needs to adjust the alarm threshold for individual clusters, then it will be very troublesome, in order to solve this problem, we did a Filter module in the back Alertmanager.

 

This module receives the alarm content, determines whether the alarm information exceeds this threshold range set by the DBA for the cluster or instance (Examples priority than cluster), and if it exceeds the transmission operation is triggered.

 

Alarm sent in different grades, different transmission mode. For example, we have defined three levels, P0, P1, P2, in descending order:
  • P0, will trigger any time, and at the same time triggering calls and micro-letters warning.

  • P1,8: 00-23: 00 only made micro letter warning, other times in a row to trigger the trigger three times before sending.

  • P2,8: 00-23: 00 send micro-channel alarm, other times not to send the trigger.


The figure is the alarm threshold management page and cluster instance (which is integrated in the operation and maintenance of the database platform of a function), it can be managed independently for each cluster and instance, when a new cluster will be selected based on CPU memory configuration to default a set of alarm thresholds corresponding to the configuration.

Cluster alarm rule Admin

Examples of alarm rules Admin

Alarm management rules
and then look at the alarm content Rewrite, such as on the map to see additional recipients, in addition to some developers DBA students also want to receive the alerts.

 

But if give them a much greater than the alarm Thread_running, they may not understand what it meant, or is there such a warning under what circumstances, and what needs attention.

 

All we need to do some rewriting alarm content, allowing developers can see and understand. We rewrite the content after the figure is.

Rewrite alarm content
as well as alarm correlation, such as a host of high disk IO, but may need to locate a result of which an example, then we can pass this warning, then go inside to monitor systems analysis may lead to a high IO instance, and alarm management.

 

as the picture shows:

IO alarm information associated instance
final say about alarm convergence, such as the host goes down, then the host above MySQL instance will trigger an alarm downtime (MySQL instance index for three consecutive reporting period no data is judged to be abnormal instance), a lot of alarms will drown out important warning, so it is necessary to do some alarm convergence.

 

We do so, the downtime after the alarm information to the host with the relevant information instance, a warning will be able to see all the information, so that we can through the contents of a warning message, and that the cluster instances which affected .

 

as the picture shows:

Examples of host associated downtime

 

⑥Graph (Paint)

 

Prometheus perfect support Grafana, we can combine Grafana by PromQL grammar, showing the rapid realization of monitoring graphs.

 

For operation and maintenance and associated platform, URL parameter passing by the way, to achieve the operation and maintenance monitoring platform directly open the map and specify a specific cluster instance.

Examples monitoring chart

Cluster Monitoring Figure

 

⑦V-DBA

 

This is a DBA of automated procedures, alarms can rely on to achieve some special action, here give an example of overload protection, overload protection we are not always on, and only when an alarm is triggered after thread_running will overload associated with the action.

DETAILED scheme shown below:

⑧ Alarm Management

 

On the operation and maintenance platform, we have a special page for managing alarms, the phone side also made adaptation, convenient DBA at all connected to the platform viewing process alarms.

 

You can see a list of current alarms triggered from the figure, no color-coded to identify the alarm has been reply (belonging to the maintenance of the alarm, do not send reply later), this represents the color has not been reply alarm (this figure belongs P2 level alarm).

 

In addition it may be noted, warning the content here because it is the DBA to see, so do not be rewritten.

PC end

End phone
based alarm log, we combine ES and Kibana realize the trap, data analysis, data analysis such interactive display to help DBA to easily complete routine patrol in the operation and maintenance of large-scale databases, quickly locate problematic clusters and timely optimization.

Alarm analysis

Based on other practices of Prometheus

 

Prometheus-based solutions, we also do other monitoring alarm correlation function expansion.

 

① cluster score
because we did alert classification, most of the alarms are P2 levels, which is sent during the day, in order to reduce the number of alarms at night.

 

But such a result may miss some alarms, leading the problem can not be exposed in time, so we did a cluster score features to analyze the health cluster.

 

And do a show for a month's score trends, convenience DBA can quickly determine whether the cluster needs to be optimized.

 

As shown below:

Cluster scores
click details, you can enter the cluster details page. You can view the usage of CPU, memory, disk (where disk space reaches 262%, meaning the quota is exceeded).

 

Another QPS, TPS, Thread_running yesterday and the 7th year of the curve, used to observe the amount of change of the cluster request. Note also marked the lowest points which several items, which are an example.

Details page

 

② indicators forecast
made less than 7 days forecast for 200G disk space, because the multi-instance deployments, it is necessary to calculate the size of the current data, log size, growth trends of the day for instance on the current host.

 

DBA can quickly locate instances need to be migrated node expansion. Implementation method is to use a (specific usage can refer to the official documentation) Prometheus of predict_linear to achieve.

Disk Space Warning

 

Log-related

 

①SlowLog
SlowLog management, we are to be collected, analyzed through a system, because to get the original log, so we did not use pt-query-digest way.

 

Architecture as follows:

Agent By collecting and formatting the log after writing the original Redis in LPUSH manner (because the data is not large, so there is no use or the MQ Kafka), and then read out by the program BLPOP slow_log_reader way through, and the process ES written later.

 

This step is mainly to achieve a SQL fingerprint extraction, film library library name to rewrite the operating logic library name.

After the write Kibana ES can be used to query the data.

Docked stop platform for database development, business development, students can check their privileged database, we also integrated millet open source SOAR, you can use this tool to explore opportunities of SQL.

By ES polymerization, the user can subscribe to reports slow query, selective TOP slow SQL view information related to the library, targeted image optimization.

②Processlist, InnoDBStatus data collection
to be able to view at the time of the session snapshots and InnoDBStatus when the fault back or malfunction, we monitor the Agent built into this feature, but also every 10 seconds, the difference will determine the current ThreadRunning whether the threshold is reached, if arrive before It will collect data, or not collected.

 

This set not only solve the problem of too many useless log, but also solve the performance abnormality can obtain the status information.

 

Is a logical next log collection processing, log processing module which is slow and query processing in a program, the processing logic and the slow query session snapshot similar, not repeat them here.

to sum up

 

Monitoring system is no absolute who is good and bad, the most important thing is for their own team, capable of rational use of minimal cost to solve the problem.

 

From 2016, we started using version 1.x to version 2.x under the line, the current monitoring system based on Prometheus, carrying all instances, the host, monitor the entire platform of the container.

 

Acquisition cycle 10 seconds, Prometheus 1 minute average number of samples per second taken 9-10W.

 

A physical machine (not including resource availability disaster recovery) can carry the current flow, and there is a great volume space (CPU \ Memory \ Disk). Under the future if the situation can not be a stand-alone support, the expansion can be clustered into a federal model.
In addition monitoring system mentioned in this article is just one module of our operation and maintenance platform, not a separate system from our practical experience, it can best be integrated into the operation and maintenance of the platform to achieve convergence technology stack and system product, platform, reducing the use of the complex.
Finally talk about in terms of monitoring our future want to do, we now have monitoring data, but just send the contents of the warning indicators of specific root cause analysis needs DBA monitoring information.

 

We plan to achieve correlation analysis warning indicators in the first stage can be given a more comprehensive monitoring metrics concluded, to help DBA to quickly locate the problem; the second phase can be more analysis results are given treatment recommendations.

 

Ultimately depends on the entire monitoring system, reduce the complexity of operation and maintenance, operation and maintenance and business development to get through direct communication barriers, improve operation and maintenance efficiency and service quality.

Author: Yan Xiaoyu

Profile: Cheng Yi Long database with technical experts with years of experience in the Internet industry DB operation and maintenance in the game, O2O and electricity supplier industry engaged in the DBA operation and maintenance work. 2016 joined with Long Cheng Yi, the current database schema design and optimization in a team responsible for operation and maintenance automation, building monitoring system MySQL, DB private cloud platform design and development work.

Editor: Tao Jialong, Sun Shujuan

Source: Reprinted from the public micro-channel number DBAplus community (ID: dbaplus)

Guess you like

Origin www.cnblogs.com/jinanxiaolaohu/p/12155915.html