List of mainstream open source monitoring systems

Reducing faults has two meanings. One is to do a good job of normal prevention to prevent faults from happening; the other is to stop losses as soon as possible and reduce the duration of faults if faults occur. The typical role of monitoring is to help us discover and locate faults. These two links are crucial to reducing the duration of faults.

Operations and maintenance personnel and R&D personnel are typical people who focus on stability, but with different emphasis. Generally speaking, the operation and maintenance personnel are responsible for the operation and maintenance of all businesses in the company, and the R&D personnel are only responsible for the R&D work of their own business lines. Therefore, when a failure occurs, the operation and maintenance personnel hope to quickly find the root cause of the problem and stop the loss in time. And R & D personnel, but also hope to "prove their innocence." Regardless of the purpose, monitoring is an indispensable tool.

There are also multiple exposure methods for business programs. The well-known burying tools are StatsD and Prometheus. Of course, some languages ​​have easier-to-use burying tools suitable for them, such as Micrometer in the Java ecosystem. In addition to index monitoring, business programs usually have richer observation methods, such as the introduction of link tracking frameworks: Zipkin, Jaeger, Skywalking, etc. Of course, all software can use logs to expose health status, but this method is the most expensive, the data is unstructured, and is suitable for troubleshooting, but it is not suitable as a source of indicator data.

Indicator monitoring can only process numbers, but its historical data storage cost is low, real-time performance is good, and the ecology is huge. It is the most important pillar in the field of observability.

Another important observability pillar is logging. A lot of information can be obtained from the log, which is critical to understanding the operation of the software and the operation of the business. For example, operating system logs, access layer logs, and service running logs are all important data sources.

The final pillar of observability is link tracing. With the popularity of microservices, the original single application is split into many small services, and there are intricate calling relationships between services. It is actually very difficult to troubleshoot which module is causing a problem.

The idea of ​​link tracking is to connect upstream and downstream modules in series with requests, and generate a random string as the request ID for each request. When the services call each other, the ID is passed down layer by layer. How long each layer takes and whether it is processed normally can be collected and attached to the request ID. When tracing the problem later, all the information in series can be extracted with the request ID.

Zabbix is ​​an enterprise-level open source solution that is good at monitoring devices, networks, and middleware. Because the monitoring system used in the past few years is mainly used to monitor equipment and middleware, Zabbix is ​​widely used in China.

 Advantages of Zabbix

  • It has good compatibility with various devices. Agentd can run not only on Windows and Linux, but also on Aix.
  • The structure is simple, and the database is used for time-series data storage, which is easy to maintain, and it is relatively easy to backup and dump.
  • The community is huge and there is a lot of information. Zabbix was open sourced in 2012, because it has been developed for a long time, and a large number of resources can be found on the Internet.

Disadvantages of Zabbix

  • Using a database for storage cannot be scaled horizontally and has limited capacity. If the collection frequency is high, such as once every 10 seconds, the upper limit can monitor about 600 devices, and the database needs to be deployed on a very high-end machine, such as SSD or NVMe disk.
  • Zabbix's asset-oriented management logic, the data structure of monitoring indicators is relatively fixed, and there is no flexible label design. Facing the dynamic and changeable environment under the cloud native architecture, it seems powerless.

Open-Falcon made a distributed timing storage component Graph based on RRDtool. This approach can form multiple machines into a cluster, greatly improving the processing capacity of massive data. The previous component responsible for forwarding is Transfer. Transfer obtains a unique ID for the monitoring data, and then hashes the ID to generate the corresponding relationship between the monitoring data and the Graph instance. This is the core sharding logic in the Open-Falcon architecture. .

 Advantages of Open-Falcon

  • It can handle large-scale monitoring scenarios and has a much larger capacity than Zabbix. It can not only handle monitoring at the device and middleware level, but also at the application level. Finally, it replaced Xiaomi's internal perfcounter and three sets of Zabbix.
  • The components are split relatively loosely, and most of them are developed in Go language. The web part is in Python, which is easy to do secondary development.

Disadvantages of Open-Falcon

  • The ecology is not large enough, and it is dominated by Xiaomi. Many companies have done secondary development, but they have not given back to the community. There are some contributors, but the number is relatively small.
  • The governance structure of open source software is not good enough. The core developers of Xiaomi company left, and the project stagnated. Xiaomi company did not invest in governance in the future. Compared with the project hosted by the foundation, it lacked vitality.

 Prometheus was born for Kubernetes. It provides direct support for Kubernetes, provides a variety of service discovery mechanisms, and greatly simplifies Kubernetes monitoring.

In the Kubernetes environment, pods are created and destroyed very frequently, and the life cycle of monitoring indicators is greatly shortened, which makes asset-oriented monitoring systems like Zabbix unable to cope with the situation. Moreover, most cloud-native environments are designed with microservices, the number of services increases, and the number of indicators It is also exploding, which puts forward very high requirements for time series data storage.

 Advantages of Prometheus

  • It supports Kubernetes very well. At present, Prometheus is the standard configuration for Kubernetes monitoring.
  • The ecology is huge, there are various Exporters, and various timing libraries are supported as the backend storage. There are also good SDKs that support multiple languages ​​for business code embedding.

 Disadvantages of Prometheus

  • The usability is poor. For example, the alarm strategy needs to modify the configuration file, which is troublesome to coordinate. Of course, for companies with better implementation of IaC, they think this is better. However, in the current domestic environment, they cannot go so far. Everyone still prefers to use the web interface to view monitoring data and manage alarm rules.
  • Exporters are uneven, usually one exporter for one monitoring target, and the management cost is relatively high.
  • For capacity issues, Prometheus only provides a single-machine timing library by default, and the cluster solution needs to rely on other timing libraries.

Nightingale can be regarded as a continuation of Open-Falcon, because the developers are a group of people, but the positioning of the two software is completely different. In the Kubernetes environment, Prometheus has become popular, and it does not make much sense to reinvent the wheel, so Nightingale's approach is Integrate well with Prometheus to create a more complete solution. The current architecture mainly regards Prometheus as a timing library and a data source of Nightingale. It's okay if you don't use Prometheus. For example, using VictoriaMetrics as a timing library is also the choice of many companies.

 Advantages of Nightingale

  • It has a relatively complete UI, authority control, and relatively complete product functions. It can be used as a company-level unified monitoring product for all teams to use. Prometheus is generally used by each team, which is more convenient. If a company uses the same Prometheus system to solve monitoring requirements, it will be troublesome and prone to the coordination problems we mentioned above, while Nightingale does a relatively good job in coordination.
  • Compatible and open in design, it supports docking with Categraf, Telegraf, Grafana-Agent, Datadog-Agent and other collectors, as well as various exporters in the Prometheus ecosystem. The timing library supports docking with Prometheus, VictoriaMetrics, M3DB, Thanos, etc.

Disadvantages of Nightingale

  • Considering the problem of network fragmentation in the computer room, the alarm engine separately removes a module and deploys it to each computer room. However, many small and medium-sized companies do not need such a complicated architecture, and deployment and maintenance are more troublesome.
  • Alarm event sending lacks aggregation and noise reduction convergence logic. The official explanation is that in the future, a separate event center product will be developed to support alarm events from multiple data sources such as Nightingale, Zabbix, and Prometheus, but it has not been released yet.

Each solution has its own advantages and disadvantages. If your main requirement is to monitor equipment, it is recommended that you use Zabbix; if your main requirement is to monitor Kubernetes, you can choose Prometheus+Grafana; if you want to take into account both traditional equipment and middleware monitoring scenarios, It is also necessary to take into account Kubernetes and make a company-level solution. It is recommended that you use Nightingale.

This article is a study note for Day27 in July. The content comes from Geek Time's "Operation and Maintenance Monitoring System Practical Notes". This course is recommended.

Guess you like

Origin blog.csdn.net/key_3_feng/article/details/131969319