Prometheus VS InfluxDB

foreword

In addition to traditional monitoring systems such as Nagios, Zabbix, Sensu, monitoring systems based on time series databases are becoming more and more popular with the rise of microservices, such as Prometheus, such as InfluxDB. gtt also tried these two systems, hoping to find the difference between the two and provide some help for future selection.

First of all, when it comes to time series databases, we have to say the old rrdtools and  graphite . These classic old systems work very well, except that some people dislike them for not being able to scale in huge-scale scenarios, and they dislike them for being inconvenient to deploy. So there are rising stars such as OpenTSDB, Prometheus, InfluxDB, etc.

surveillance system

OpenTSDB

OpenTSDB : A time series database based on Hadoop and HBase. It first proposed a method of adding tags (key-value pairs) to metrics to achieve a more convenient and powerful query syntax. The design and query syntax of InfluxDB are inspired by it very large. OpenTSDB is based on Hadoop and HBase to achieve abnormal horizontal expansion capabilities, but also because of these two dependencies, for teams unfamiliar with Hadoop, the maintenance cost of OpenTSDB is high, so someone came up with InfluxDB.

InfluxDB

InfluxDB : InfluxData uses golang to implement the time series database. One of InfluxDB's slogans is: From the ground up, without any external dependencies, just an executable file, which can be run by throwing it on the server, which is very friendly to operation and maintenance. The design of the syntax is heavily inspired by OpenTSDB. Although the project initially advertised its own cluster function, which can easily achieve horizontal expansion, the cluster function was deleted after InfluxDB 1.0, and replaced by the Relay mode to achieve high availability. The official documents are listed as follows, but version 0.9 The cluster usage instructions of , can still be accessed on the official website, and it is estimated that it is very likely to be deleted in the future .

Note: Clustering is now a commerial product called InfluxEnterprise. More information can be found here.

Prometheus

Prometheus : SoundCloud's open source monitoring system has been submitted to the open source community for independent operation. And like k8s, it is a member of the Cloud Native Computing Foundation  , although this Foundation currently only has two projects, k8s and Prometheus. The biggest difference between Prometheus and the above two can be understood as: the above two are just databases, while Prometheus is a monitoring system, which not only includes a time series database, but also capture, retrieval, drawing, and alarm functions. The official also gave a detailed description of this difference .

It is largely inspired by Google's internal Borgmon system, a monitoring system implemented based on the pull mode. There is this sentence in the book "Site Reliability Engineering" that mentions Prometheus, of course I won't tell you that the original text actually mentions the two monitoring systems Bosun and Riemann, which is why there is an ellipsis at the end of this sentence:

Even though Borgmon remains internal to Google, the idea of treating time-series data as a data source for generating alerts is now accessible to everyone through those open source tools like Prometheus […]»
— Site Reliability Engineering: How Google Runs Production Systems (O’Reilly Media)

However, InfluxData has also launched a complete set of solutions around time series databases: TICK, which covers data acquisition (Telegraf), storage and query (InfluxDB), charting (Chronograf), and alarming (Kapacitor). This solution is very similar to Elastic's approach: around the core functions of ElasticSearch, Logstash and Kibana have been acquired, and peripheral services such as Beat and Watcher have been developed to create a complete full-text search solution with complete functions.

A bit farther, back to the core content of the article: what is the difference between InfluxDB and Prometheus. The main difference at present is that the former is just a database, which passively accepts data insertion and query requests from customers. The latter is a complete monitoring system that can capture data, query data, alarm and other functions.

Push vs Pull

At this point, we know that Prometheus is based on pull, and InfluxDB is based on push. Regarding push and pull, I have written  a comparison between ansible and puppet before , but there are subtle differences in the monitoring system.

First of all, Push and Pull describe the way of data transmission, it does not affect the content of transmission. In other words, as long as it is the information that can be carried by push, pull can definitely carry the same information, such as monitoring data such as "CPU utilization 30%", whether it is pull or push, the content of the transmission is still these, and it will not change because of the transmission mode. As a result, the message volume is extremely long, so the network bandwidth consumed by the two methods will not be very different.

gtt believes that the main difference between Push and Pull is:

different initiators

The initiator of pull is the monitoring system, which polls the monitored targets in turn, so if the target is inside a firewall or behind a NAT, the Pull method will not work. Moreover, for batch type tasks, since the entire processing time may be less than the polling interval, the monitoring system will not capture the data of such tasks.

In order to solve these two problems, Prometheus provides the pushgateway_exporter component to support the monitoring requirements of push mode.

Push requires the initiator to be the monitored target, so it can break through the firewall restrictions. Even if the target hides behind NAT, it can still push data smoothly, and it can also send data more calmly for batch-type tasks.

In addition, some people say that "the monitoring system in push mode is a single point, there will be a single point of failure and performance bottlenecks, while pull mode does not."

Here gtt disagrees, because the way to solve the single point of failure of pull is to add another monitoring system, which is essentially to improve reliability through data redundancy, so why can't push be pushed to two monitoring systems? This can also be done Data redundancy.

For performance bottlenecks, this is even less true, because whether it is push or pull, it only affects the transmission method, and has no effect on the content of the transmitted data, and the bandwidth occupied is the same. The only difference is that the degree of concurrency may be different. In the push mode, the target service may push data to the monitoring system centrally within a certain period of time, resulting in a large number of instantaneous concurrent requests, similar to a DDos attack. On the contrary, pull polls the target service and can process monitoring data according to the concurrency that it can afford, avoiding the situation of monitoring data bursting in a short period of time. But there are solutions. In push mode, add a request processing queue to the monitoring service, and temporarily store requests that exceed the load of the monitoring system in the queue, so that the monitoring system can process data at its own pace and prevent teammates from giving DDos.

Therefore, single-point problems and performance problems are not the essential difference between the two modes.

different logical architectures

push requires the monitored target to know the address (IP or domain name) of the monitoring system, so this part of the information needs to be set in the target service, in other words, the target service depends on the monitoring system. If the address of the monitoring system changes, all target services need to be changed accordingly. Once a dependency occurs, it means that the monitoring system fails, which may affect the normal operation of the target service. Of course, some avoidance can be done during programming, but the logic is still that the target service depends on the monitoring system. The architecture diagram is shown in the following figure:

surveillance system

pull requires the monitoring system to know the addresses of all target services, and the target services are unaware of the monitoring system. Therefore, the monitoring system depends on the target service, and each time a new target service is added, the configuration of the monitoring system is modified. From this point of view, the pull mode is more in line with the logical architecture. In order to automate the addition and deletion of target services, Prometheus 服务发现系统supports dynamically obtaining the address of target services from it, eliminating the need for complex configuration in the case of large-scale microservice deployments. The logical architecture is shown in the following figure:

Monitoring System(2)

The chicken thief should have found out, isn't that 服务发现系统dependent on all target services? Yes, the service discovery system and the monitoring system are in different positions in the logical architecture. In 服务发现some architectures, if the target service is not "discovered", it cannot actually provide services normally, so it must be relied on 服务发现系统. On the contrary, the target service Services can still function without a monitoring system. The logical architecture is shown in the following figure:

Monitoring System(3)

Because it does not rely on the monitoring system, even if the monitoring service is not deployed, it is very easy to manually judge whether the target service is normal. It is only necessary to simulate the monitoring system to access an interface of the target service. Therefore, the monitoring in pull mode is more white-box, and you can easily Get all the information easily. On the contrary, the push mode relies on a formed monitoring service. Without the monitoring service, it is unacceptable to know how the target service is running.

query syntax

At this point, we know that Prometheus obtains data based on the pull mode, and InfluxDB obtains data based on the push mode. Now focus on the difference between the two in data query.

For example, to get the data of disk IO time:

时间戳           metric: 值  tag
1475216224 disk_io_time:10  type="sda" 
1475216224 disk_io_time: 30 type="sdb"
1475216224 disk_io_time: 11 type="sdc"
1475216224 disk_io_time: 18 type="sde"

Basic query

As a basic time series database, the basic acquisition of data is very simple for both.

InfluxDB:

SELECT mean("value") FROM "disk_io_time" WHERE $timeFilter GROUP BY time($interval), "instance" fill(null)

Prometheus:

disk_io_time

basic arithmetic

There is not much difference between the two:

InfluxDB:

SELECT mean("value") *1024 FROM "disk_io_time" WHERE $timeFilter GROUP BY time($interval), "instance" fill(null)

Prometheus:

disk_io_time*1024

calculation speed

InfluxDB:

SELECT derivative(mean("value"), 10s) *1024 FROM "disk_io_time" WHERE $timeFilter GROUP BY time($interval), "instance" fill(null)

Prometheus:

rate(disk_io_time)*1024

Calculation between dimensions

This is the biggest difference between the two that gtt has found so far. For example, I need to add the io time of sda and sdc. InfluxDB does not support such a syntax yet, but the community is already discussing related implementations: [feature request] Mathematics across measurements #3552 .

And Prometheus can accomplish this task:

rate(disk_io_time{type="sda"}) + rate(disk_io_time{type="sdc"})

Summarize

Overall, Prometheus is a reliable monitoring system. Its design is deeply inspired by Google's internal Borgmon system and has an elegant query syntax. . InfluxDB is only a time series database and has no other monitoring-related functions, but InfluxData also provides other supporting components to choose from. Compared with Prometheus, its query syntax is more complex and does not support computation between dimensions.

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325056911&siteId=291194637
Recommended