Monitoring system-prometheus

Monitoring system-prometheus

[Part of the picture shows a problem - such as the impact reading can be the venue to the monitoring system -prometheus ]

1 Introduction

1.1 Official introduction

Prometheus is a former Google engineer from 2012 in Soundcloud be in the form of open source software development system monitoring and alerting tool kit, since then, many companies and organizations have adopted Prometheus as alarm monitoring tool. The developer and user community of Prometheus is very active. It is now an independent open source project that can be maintained independently of any company. To prove this, Prometheus in May 2016 to join CNCF Foundation, becoming Kubernetes after the second CNCF managed projects.

Brief history

The picture is relatively old and has now been updated to version 2.25.

1.2 Main functions/advantages

  • Multi-dimensional data model (time series is composed of metric name and k/v labels)
  • Powerful query statement ( PromQL )
如以下案例:查询实例为10.224.192.\d{3}:9100的机器,用户近30s平均cpu使用率。
avg(rate(node_cpu_seconds_total{instance=~"10.224.192.\\d{3}:9100", mode="user"}[30s]))*100
  • No dependency storage, supporting different models of local and remote. A single service node has autonomy
  • Use http protocol, use pull mode to pull data, simple and easy to understand. You can also use an intermediate gateway to push data
  • Monitoring target, you can use service discovery or static configuration
  • Support a variety of statistical data models, graphically friendly

1.3 Comparison of monitoring systems

Monitoring system comparison chart

Here is a comparison of four monitoring systems
Zabbix and Nagios, both of which are veteran monitoring systems.
open-falcon is an open-source monitoring system of Xiaomi.

  1. From the perspective of system maturity, both Zabbix and Nagios are veteran monitoring systems with relatively stable system functions and high maturity. Both Prometheus and Open-Falcon were born in the last few years, and the functions are still iteratively updated.
  2. Scalability: Prometheus can expand system collection capabilities through various exporters, and expand storage solutions through interfaces, which will be introduced later.
  3. In terms of activity: the Prometheus community is the most active and is also supported by CNCF;
  4. In terms of performance, the main difference lies in storage. Prometheus uses a high-performance time series database, Zabbix uses a relational database, and both Nagios and Open-Falcon use RRD data storage.
  5. In terms of container support: Zabbix and Nagios appeared earlier, and containers have not yet been born, so the support for containers is relatively poor. Open-Falcon has limited support for container monitoring. Prometheus' dynamic discovery mechanism supports the monitoring of various container clusters and is currently the best solution for container monitoring.

1.4 In summary

  • Prometheus is a one-stop monitoring and alarm platform with few dependencies and complete functions.

  • Prometheus supports monitoring of the cloud or container, and other systems mainly monitor the host.

  • Prometheus data query statements are more expressive, with more powerful built-in statistical functions.

  • Prometheus is not as good as InfluxDB, OpenTSDB, and Sensu in terms of data storage scalability and durability.

  • Applicable scene

    ​ Prometheus is suitable for recording time series in text format. It is suitable for both machine-centric monitoring and highly dynamic service-oriented architecture monitoring. In the world of microservices, it has special advantages for multi-dimensional data collection and query support. Prometheus is designed to improve system reliability. It can quickly diagnose problems during power outages. Each Prometheus Server is independent of each other and does not rely on network storage or other remote services. When the infrastructure fails, you can quickly locate the point of failure through Prometheus without consuming a lot of infrastructure resources.

  • Not applicable scenarios

    ​ Prometheus attaches great importance to reliability, even in the event of a failure, you can always view the available statistical information about the system. If you need 100% accuracy, such as billing based on the number of requests, then Prometheus is not suitable for you, because the data it collects may not be detailed and complete. In this case, you'd better use other systems to collect and analyze data for billing, and use Prometheus to monitor the rest of the system.

2. Architecture Design

2.1 Overall architecture

Architecture diagram

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-QWpDeuaA-1617093666707)(https://raw.githubusercontent.com/1458428190/prometheus-demo/main/images /prometheus.svg)]

​ Prometheus contains several key components, that is, the yellow part of the above figure: Prometheus server, Pushgateway, Alertmanager, and related webui.

​ The Prometheus server will get the target node to pull data from service discovery or static configuration, and then periodically pull and store the data from the target node. The target node here can be a variety of exporters that directly expose data through the http interface, or it can It is a pushgateway dedicated to receiving push data.

​ Prometheus can also configure various rules, and then query data regularly. When the condition is triggered, it will push the alert to the configured alertmanager.
Finally, alertmanager receives a warning. Alertmanager can aggregate, de-duplicate, reduce noise according to the configuration, and finally send a warning.

Prometheus supports collecting information from other prometheus instances. If there are multiple data centers, a separate promeheus instance can be deployed in each data through a federated cluster, and then a central Prometheus Server is responsible for aggregating the monitoring data of multiple data centers .

2.2 Main modules

2.2.1 prometheus server

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-mJfwjY0Q-1617093666708)(https://raw.githubusercontent.com/1458428190/prometheus-demo/main/images /prometheus-server.svg)]

​ According to the working principle, prometheus itself first discovers the target node through the Scrape discovery manager, then uses the Scrape manager to scrape the indicators,
and then stores the data in the corresponding local storage or remote storage through a storage agent layer Fanout storage.

​ Local storage: The sample data is stored in the local disk in a custom storage format, but the local storage itself is less reliable, and it is only recommended for use in scenarios with low data persistence requirements.

​ Remote storage: In order to expand flexibly, Prometheus defines two standard interfaces (remote_write/remote_read) so that users can store data to any third-party storage service based on these two interfaces.

​ RuleManager is responsible for the regular query of the configured rules. When the condition is triggered, it will write the result to the storage, and push the alert information that needs to be triggered to the alertmanager through the notifier

2.2.2 exporter

You only need to simply deploy the exporter (there are many official and third-party libraries, or you can implement it yourself), you can collect many common indicator data, such as database indicators, hardware indicators...see the official website for details.

2.2.3 alertmanager

2.2.3.1 Introduction

​ The alarm capability is divided into two independent parts in the Prometheus architecture. By defining AlertRule in the Prometheus server, Prometheus will periodically calculate the alert rule, and if the alert trigger condition is met, it will send alert information to Alertmanager.

Data flow chart

The composition of an alert rule:

  • Alarm name

    The user needs to name the alarm rule. Of course, for naming, it needs to be able to directly express the main content of the alarm

  • Alert rules

    The alarm rule is actually defined by PromQL, and its actual meaning is when the query result of the expression (PromQL) lasts for how long (During) the alarm will be triggered

    Alertmanager, as an independent component, is responsible for receiving and processing alert information from Prometheus Server (or other client programs). Alertmanager can perform further processing on these alert information, such as deduplication, grouping and routing. At present, Prometheus has built-in support for multiple notification methods such as email and Slack. Other notification methods, such as popo or SMS notification, can be implemented through Webhook.

    Here is the background knowledge. The alerts in the Prometheus ecosystem are calculated and generated in the Prometheus Server. The so-called alert rules are actually executed periodically for a period of PromQL, and the result of the query will be recorded. The time series ALERTS used for alarms {alertname = "", <alert tag>} is used for subsequent alarms. When Prometheus Server calculates some alarms, it does not have the ability to notify these alarms. It can only push the alarms to Alertmanager, and Alertmanager sends them. This segmentation, on the one hand, is due to the consideration of single responsibility, on the other hand, because alarm sending is really not a "simple" thing, and a special system is needed to do it well. It can be said that the goal of Alertmanager is not simply to "send an alert", but to "send a high-quality alert."

2.2.3.2 Features

​ In addition to providing basic alert notification capabilities, Alertmanager also provides alert features such as grouping, suppression, and silence:characteristic

  • Grouping: The grouping mechanism can combine detailed alarm information into one notification

    ​ The grouping mechanism can combine detailed alarm information into one notification. In some cases, for example, due to system downtime, a large number of alarms are triggered at the same time. In this case, the grouping mechanism can combine these triggered alarms into one alarm notification to avoid receiving a large number of alarm notifications at one time. Quickly locate the problem.

    ​ For example, when there are hundreds of running service instances in the cluster, and alarm rules are set for each instance. If a network failure occurs at this time, a large number of service instances may not be able to connect to the database. As a result, hundreds of alarms will be sent to Alertmanager. As a user, you may only want to be able to see which service instances are affected in one notification. At this time, the alarms can be grouped according to the service cluster or alarm name, and these alarms can be grouped together to form a notification.

    Alarm grouping, alarm time, and alarm receiving method can be configured through the configuration file of Alertmanager.

  • Suppression: When an alarm is sent out, you can stop repeatedly sending other alarms caused by the alarm mechanism

    ​ Suppression refers to a mechanism that can stop sending other alarms caused by this alarm repeatedly after a certain alarm is sent out.

    ​ For example, when an alarm is triggered when the cluster is inaccessible, all other alarms related to the cluster can be ignored by configuring Alertmanager. In this way, you can avoid receiving a large number of alarm notifications that are not related to the actual problem. The suppression mechanism is also set through the configuration file of Alertmanager.

    • Silence: Provides a simple mechanism to quickly silently handle alarms based on tags. If the received alarm matches the silent configuration, Alertmanager will not send an alarm notification

      ​ Silence provides a simple mechanism to quickly silently handle alarms based on tags. If the received alarm matches the silent configuration, Alertmanager will not send an alarm notification. The silent setting needs to be set on the Werb page of Alertmanager.

2.2.3.3 Architecture

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-Ee2TYGsR-1617093666711)(https://raw.githubusercontent.com/1458428190/prometheus-demo/main/images /alertmanager.svg)]

  1. Starting from the upper left, an alarm rule is configured in the prometheus server. By default, these alarm rules are calculated every minute. If the trigger conditions are met, an alarm record will be generated and pushed to the alertmanager through the API.
  2. Alertmanager receives the alert record through API, then verifies, converts, and finally stores the alert information in AlertProvider. The current implementation here is stored in memory, and other storage methods can be implemented by itself through the interface.
  3. Dispatcher will always listen to and receive new alarms, and route the alarms to the corresponding group according to the configuration of the routing tree, so as to facilitate the unified management of those alarms with the same attribute information [through group_by** to define the grouping rules. Based on the label contained in the alarm]. This can greatly reduce the alarm storm.
  4. Each group will periodically perform the Notification Pipeline process processing with the configured cycle time;
  5. This Notification Pipeline process actually executes suppression logic, silent logic, waiting for data collection, deduplication, sending and retry logic, and finally realizes the alarm. After the alarm is completed, the notification record will be synchronized to the cluster, and subsequently used for silence between clusters. And deduplication
  • Alarm suppression

    ​ It can avoid the user receiving a large number of other alarm notifications caused by the problem after a certain problem alarm is generated. For example, when the cluster is unavailable, the user may only want to receive an alarm telling him that there is a problem with the cluster at this time, rather than a large number of alarm notifications such as abnormal applications in the cluster and abnormal middleware services.

    inhibit_rules:
    - source_match:
        alertname: NodeDown
        severity: critical
      target_match:
        severity: warning
      equal:
      - node
    

    For example, when a certain host node in the cluster is abnormally down, the alarm NodeDown is triggered, and the alarm level severity=critical is defined in the alarm rule. Due to the abnormal downtime of the host, all services and middleware deployed on the host will be unavailable and trigger an alarm. According to the definition of suppression rules, if there is a new alarm level of severity=critical, and the value of the label node in the alarm is the same as that of the NodeDown alarm, it means that the new alarm is caused by NodeDown, and the suppression mechanism is started to stop sending to the receiver. Notice.

  • Alert silence

    ​ The user temporarily blocks specific alarm notifications through the background or API configuration. By defining the matching rule (string or regular expression) of the label, if the new alarm notification meets the setting of the silent rule, stop sending the notification to the receiver.

    Silent

2.2.4 pushgateway

​ Its existence allows short-lived and batch jobs to expose its metrics to Prometheus. Since the life cycle of these jobs may not be long enough, there will not be enough time for Prometheus to crawl their metrics. Pushgateway allows them to push their metrics to Pushgateway, and Pushgateway then exposes these metrics to Prometheus for crawling.

The main reasons for using it are:

  • Prometheus uses the pull mode. Prometheus may not be able to directly pull the target data due to the fact that it is not in the same subnet or firewall.
  • When monitoring business data, different data needs to be aggregated and collected by Prometheus.
  • Data collection for temporary tasks

Disadvantages:

  • Aggregate data from multiple nodes to pushgateway. If pushgateway fails, the impact will be greater than multiple targets.
  • Prometheus pull state uponly for pushgateway, can not be effective for each node.
  • Pushgateway can persist all monitoring data pushed to it.

Therefore, even if your monitoring is offline, prometheus will still pull the old monitoring data, and you need to manually clean up the data that is not required by pushgateway.

3. How to integrate

See another article for details~ https://juejin.cn/post/6943406808513904654

Guess you like

Origin blog.csdn.net/qq_31281327/article/details/115314824