prometheus: (2) Monitoring overview

Table of contents

One: Introduction to Monitoring System

Operation and maintenance monitoring platform design ideas 

Two: prometheus basic resource monitoring

2.1 Network Monitoring

2.2 Storage Monitoring

2.3 Server Monitoring

2.4 Middleware monitoring

2.5 Application Monitoring (APM)

Three: Introduction to common monitoring systems

3.1 Cacti

3.2 Nagios

3.3 Zabbix

3.4 Prometheus

3.5 Open-falcon

Four: Comparison of prometheus and other monitoring tools

4.1 Prometheus vs Zabbix

4.2 Prometheus vs Graphite 

4.3 Prometheus vs InfluxDB

4.4 Prometheus vs OpenTSDB

4.5 Prometheus vs Nagios

4.6 Prometheus vs Sensu

Five: What can Prometheus monitor?

Six: Prometheus monitoring of kubernetes

Seven: Prometheus alarm processing

7.1 Introduction to Prometheus Alarms

7.2 Alertmanager features

7.2.1 Grouping

7.2.2 Inhibition

7.2.3 Silence

Eight: Summary

One: Introduction to Monitoring System

The monitoring system here specifically refers to the monitoring of the data center , mainly monitoring and alarming the hardware and software in the data center. The IT architecture of enterprises is gradually migrating from traditional physical servers to IaaS clouds dominated by virtual machines. No matter how the infrastructure is adjusted, it is inseparable from the support of the monitoring system.

Not only that. The increasingly complex data center environment puts higher and higher requirements on the monitoring system: different objects need to be monitored , such as containers, distributed storage, SDN networks, and distributed systems. There are many kinds of applications, etc., and a large amount of monitoring data needs to be collected and stored , such as the collection and summary of several terabytes of data every day. And intelligent analysis, alarm and early warning based on these monitoring data.

In each enterprise's data center, some open source or commercial monitoring systems are more or less used. From the perspective of monitoring objects, monitoring can be divided into network monitoring, storage monitoring, server monitoring and application monitoring , because it is necessary to monitor all aspects of the data center. Therefore, the monitoring system needs to be comprehensive and act as the "eye of the sky" in the data center.

Operation and maintenance monitoring platform design ideas 

  1. data collection module
  2. Data Extraction Module
  3. Monitoring and alarm module

Can be subdivided into 6 layers

第六层:用户展示管理层    同一用户管理、集中监控、集中维护
第五层:告警事件生成层    实时记录告警事件、形成分析图表(趋势分析、可视化)
第四层:告警规则配置层    告警规则设置、告警伐值设置
第三层:数据提取层    定时采集数据到监控模块
第二层:数据展示层    数据生成曲线图展示(对时序数据的动态展示)
第一层:数据收集层    多渠道监控数据

Two: prometheus basic resource monitoring

2.1 Network Monitoring

Network performance monitoring: mainly involves network monitoring, network real-time traffic monitoring (network delay, traffic, success rate) and historical data statistics, summary and historical data analysis and other functions.

Network performance detection: mainly for the network performance of the internal network or external network. Such as DDoS performance. Determine network performance behavior by analyzing anomalous traffic.

Device monitoring: mainly monitors various network devices in the data center. Including hardware devices such as routers, firewalls and switches, data can be collected through protocols such as snmp.

2.2 Storage Monitoring

In terms of storage performance monitoring: storage usually monitors the read and write rate of blocks, IOPS. Read and write latency, disk usage, etc.; file storage usually monitors file system inodes. Read and write speed, directory permissions, etc.

Storage system monitoring: Different storage systems have different indicators. For example, for ceph storage, it is necessary to monitor the operating status of OSD and MON, the number of pgs in various states, and cluster IOPS and other information.

Storage device monitoring: For storage devices built on x86 servers, device monitoring collects device information such as disks, SSDs, and network cards through collectors on each storage node; storage manufacturers provide commercial storage devices in a black box, usually by themselves With monitoring function, it can monitor the running status, performance and capacity of the equipment.

2.3 Server Monitoring

CPU: involves the usage of the entire CPU, the percentage of user mode, the percentage of kernel mode, the usage of each CPU, the length of the waiting queue, the percentage of I/O waiting, the process with the most CPU consumption, the number of context switches, the cache hit rate, etc.

Memory: involves memory usage, remaining amount, process with the highest memory usage, swap partition size, page fault exception, etc.

Network I/O: involves the upstream traffic, downstream traffic, network delay, packet loss rate, etc. of each network card.

Disk I/O: involves the read/write rate of the hard disk, IOPS, disk usage, read/write delay, etc.

2.4 Middleware monitoring

Message middleware : RabbitMQ, Kafka

Web service middleware : Tomcat, Jetty

Cache middleware : Redis, Memcached

Database middleware : MySQL, PostgreSQL

2.5 Application Monitoring (APM)

APM is mainly for application monitoring, including application running status monitoring, performance monitoring, log monitoring, and call chain tracking. Call chain tracking refers to tracking the entire request process (from the user sending a request, usually referring to the browser or application client) to the back-end API service and API service and associated middleware, or calls between other components to build a complete Not only that, but APM can also monitor the call hierarchy (Controller-->service-->Dao) of the component's internal methods to obtain the execution time of each function, thereby providing data support for performance tuning.

In addition to Pinpoint, the application monitoring tools also include Twitter's open source Zipkin, Apache SkyWalking, Meituan's open source CAT, etc.

Invoke key monitoring:


Three: Introduction to common monitoring systems

3.1 Cacti

Cacti (meaning cactus in English) is a set of graphical analysis tools for network traffic monitoring developed based on PHP, MySQL, SNMP and RRDtool. It obtains data through snmpget and uses RRDTool to draw, but users don't need to understand the complicated parameters of RRDTool. It provides very powerful data and user management functions. It can specify that each user can view the tree structure, host device and any picture. It can also be combined with LDAP for user authentication. At the same time, it can also customize the template. In the historical data In terms of display monitoring, its function is quite good.
       By adding templates, cacti makes the monitoring addition of different devices reusable, and has the function of customizable drawing and powerful computing capabilities (data superposition function).

3.2 Nagios

Nagios is an open source free network monitoring tool that can effectively monitor the host status of windows, Linux and Unix, network settings such as switches and routers, printers, etc. When the system or service status is abnormal, an email or SMS alarm is sent to notify the website operation and maintenance personnel at the first time, and a normal email or SMS notification is sent after the status is restored.
  The main feature of Nagios is monitoring alarms . The most powerful is the alarm function, which can support multiple alarm methods, but the disadvantage is that there is no powerful data collection mechanism, and the data graph is also very simple. When more and more hosts are monitored, Adding hosts is also very troublesome. Configuration files are based on text configuration and do not support web management and configuration. This is easy to make mistakes and is not suitable for maintenance.

3.3 Zabbix

abbix is ​​an enterprise-level open source solution that provides distributed system monitoring and network monitoring functions based on a web interface. Zabbix can monitor various network parameters to ensure the safe operation of the server system; and provide a powerful notification mechanism to allow system operation and maintenance personnel to quickly locate/solve various problems.
  zabbix consists of 2 parts, zabbix server and optional component zabbix agent. zabbix server can provide remote server/network status monitoring, data collection and other functions through SNMP, zabbix agent, ping, port monitoring and other methods. It can run on Linux, Solaris, HP-UX, ALX, Free BSD, open BSD, os x and other platforms.
  Zabbix solves the problem that cacti has no alarms, and also solves the disadvantage that nagios cannot be configured through the web. At the same time, it also supports distributed deployment, which makes it popular quickly. Zabbix has also become the most popular operation and maintenance monitoring platform for small and medium-sized enterprises. Of course, zabbix also has shortcomings. It consumes a lot of resources. If there are many monitored hosts (the number of servers exceeds 500), monitoring timeout, alarm timeout, and single point failure of the alarm system may occur. However, there are many solutions Methods, such as improving hardware performance, changing zabbix monitoring mode, multiple sets of zabbix, etc.

Monitoring method:

  • Agent agent: special agent service mode for monitoring, exclusive protocol, the host equipped with zabbix-agent can be monitored by zabbix-server, and the data is sent to the server for processing in an active or passive manner.
  • ssh/telnet: The linux host supports the ssh/telnet protocol
  • snmp: Network device routers and switches cannot install third-party programs (agents), and use simple network protocols. Most router devices support SNMP protocol.
  • ipmi: Monitor through the ipmi interface. We can monitor the physical characteristics of the monitored object through the standard ipmi hardware interface, such as voltage, temperature, fan status, and power supply. It is widely used in service monitoring, including collecting cpu temperature, fan speed, Motherboard temperature, remote power on and off, etc., and ipmi is independent of hardware and operating system, no matter if cpu, bios or os fails, it will not affect the work of ipmi, because the hardware device BMC (bashboard management controller) of ipmi is an independent board, independent power supply.

Introduction to zabbix core components:

  • zabbix server: The core program of zabbix software monitoring, the main function is to interact with zabbixproxies and agents, trigger calculation, send alarm notification; and save the data centrally. Similar to prometheus, the collected data can be saved, but the alter manager component is required for prometheus alarms.
  • database storage: store configuration information and collected data.
  • web Interface: The GUI interface of zabbix, usually running on the same machine as the server.
  • proxy: an optional component, often used in a distributed monitoring environment, a program that helps zabbix server collect data and share the load of zabbix server.
  • agent: Deployed on the monitored host, it is responsible for collecting data and sending it to the server.

3.4 Prometheus

borg.kubernetes
  borgmon (monitoring system) corresponds to the cloned version: prometheus (developed in go language), so prometheus is especially suitable for the K8S architecture. And as a data monitoring solution, it is supported by a large community with 6300 contributors from more than 700 companies, 13500 code commits and 7200 pull requests.

Prometheus has the following features:

  • Multidimensional data model (Key-value key-value pairs based on time series)
  • Flexible query and aggregation language PromQL
  • Provide local storage and distributed storage
  • Collect time series data through the Pull model based on HTTP and HTTPS (pull data push, time series: data value indicators at each time point, continuous generation. The horizontal axis marks the time, the vertical axis is the data value, and the value within a period of time The dynamic changes, all the dots and lines form a large-scale line chart)
  • Push mode can be implemented using Pushgateway (optional middleware for Prometheus)
  • The target machine can be discovered through dynamic service discovery or static configuration (automatic discovery and contraction through consul)
  • Support a variety of charts and data dashboards

3.5 Open-falcon

open-falcon is an open-source enterprise-level monitoring tool developed by Xiaomi. It is developed in go language and is used by Internet companies including Xiaomi, Didi, Meituan, etc. It is a flexible, scalable and high-performance monitoring solution.

Operation and Maintenance Monitoring-Open-Falcon Introduction- Yin Zhengjie- Blog Park

PS:
  Nightingale is an enterprise-level monitoring solution developed and open-sourced by Didi Basic Platform and Didi Cloud. It is designed to meet the enterprise-level monitoring needs in the cloud-native era.
  Nightingale has met enterprise-level requirements in terms of product completion, high system availability, and user experience, and can meet the scenarios of users of different scales, ranging from a few machines to hundreds of thousands. Taking into account both cloud native and bare metal, it supports application monitoring and system monitoring. The plug-in mechanism is flexible, and the plug-ins are rich and complete, with a high degree of flexibility and scalability.
  Nightingale is a distributed high-performance operation and maintenance monitoring system. On the basis of Open-Falcon, each core module has been greatly optimized, and it has introduced Didi's production practice experience combined with Didi's internal best practices. A lot of improvements have been made in terms of maintainability and ease of use. As a unified monitoring solution for the group, it supports billions of internal monitoring indicators in Didi, covering the monitoring needs of systems, containers, and applications. Thousands of active users. Five years of sharpening a sword, taking it from open source, and giving back to open source. Nightingale Fork from Open-Falcon, Nightingale can be regarded as the next generation of Open-Falcon.
https://cloud.tencent.com/developer/article/1638839?from=15425


Four: Comparison of prometheus and other monitoring tools

4.1 Prometheus vs Zabbix

  • Zabbix uses C and PHP, Prometheus uses Golang, and overall Prometheus runs a bit faster.
  • Zabbix belongs to traditional host monitoring and is mainly used for monitoring physical hosts, switches, networks, etc. Prometheus is not only suitable for host monitoring, but also for Cloud, SaaS, Openstack, and Container monitoring.
  • Zabbix has richer plugins for traditional host monitoring.
  • Zabbix can configure many things in WebGui, but Prometheus needs to manually modify the file configuration.

4.2 Prometheus vs Graphite 

  • Graphite has fewer functions. It focuses on two things, storing time series data and visualizing data. Other functions need to install related plug-ins, while Prometheus is a one-stop shop that provides common functions of alarm and trend analysis. It provides stronger data storage and Query capability.
  • Graphite does a better job in terms of horizontal expansion schemes and data storage cycles.

4.3 Prometheus vs InfluxDB

  • InfluxDB is an open source time-series database, which is mainly used to store data. If you want to build a monitoring and alarm system, you need to rely on other systems.
  • InfluxDB does a better job in storage horizontal expansion and high availability, after all, the core is the database.

4.4 Prometheus vs OpenTSDB

  • OpenTSDB is a distributed time-series database that relies on Hadoop and HBase to store longer-term data. If your system has already run Hadoop and HBase, it is a good choice.
  • If you want to build a monitoring and alarm system, OpenTSDB needs to rely on other systems.

4.5 Prometheus vs Nagios

  • Nagios data does not support custom Labels, does not support query, and alarms do not support denoising, grouping, and no data storage. If you want to query historical status, you need to install a plug-in.
  • Nagios is a monitoring system in the 1990s. It is more suitable for monitoring small clusters or static systems. Obviously, Nagios is too old and does not have many features. Compared with Prometheus, it is much better.

4.6 Prometheus vs Sensu

  • In a broad sense, Sensu is an upgraded version of Nagios. It solves many problems of Nagios. If you are familiar with Nagios, using Sensu is a good choice.
  • Sensu relies on RabbitMQ and Redis, which has better scalability in data storage.

Five: What can Prometheus monitor?

# Databases---数据库
    Aerospike exporter
    ClickHouse exporter
    Consul exporter (official)
    Couchbase exporter
    CouchDB exporter
    ElasticSearch exporter
    EventStore exporter
    Memcached exporter (official)
    MongoDB exporter
    MSSQL server exporter
    MySQL server exporter (official)
    OpenTSDB Exporter
    Oracle DB Exporter
    PgBouncer exporter
    PostgreSQL exporter
    ProxySQL exporter
    RavenDB exporter
    Redis exporter
    RethinkDB exporter
    SQL exporter
    Tarantool metric library
    Twemproxy
# Hardware related---硬件相关
    apcupsd exporter
    Collins exporter
    IBM Z HMC exporter
    IoT Edison exporter
    IPMI exporter
    knxd exporter
    Netgear Cable Modem Exporter
    Node/system metrics exporter (official)
    NVIDIA GPU exporter
    ProSAFE exporter
    Ubiquiti UniFi exporter
# Messaging systems---消息服务
    Beanstalkd exporter
    Gearman exporter
    Kafka exporter
    NATS exporter
    NSQ exporter
    Mirth Connect exporter
    MQTT blackbox exporter
    RabbitMQ exporter
    RabbitMQ Management Plugin exporter
# Storage---存储
    Ceph exporter
    Ceph RADOSGW exporter
    Gluster exporter
    Hadoop HDFS FSImage exporter
    Lustre exporter
    ScaleIO exporter
# HTTP---网站服务
    Apache exporter
    HAProxy exporter (official)
    Nginx metric library
    Nginx VTS exporter
    Passenger exporter
    Squid exporter
    Tinyproxy exporter
    Varnish exporter
    WebDriver exporter
# APIs
    AWS ECS exporter
    AWS Health exporter
    AWS SQS exporter
    Cloudflare exporter
    DigitalOcean exporter
    Docker Cloud exporter
    Docker Hub exporter
    GitHub exporter
    InstaClustr exporter
    Mozilla Observatory exporter
    OpenWeatherMap exporter
    Pagespeed exporter
    Rancher exporter
    Speedtest exporter
# Logging---日志
    Fluentd exporter
    Google's mtail log data extractor
    Grok exporter
# Other monitoring systems
    Akamai Cloudmonitor exporter
    Alibaba Cloudmonitor exporter
    AWS CloudWatch exporter (official)
    Cloud Foundry Firehose exporter
    Collectd exporter (official)
    Google Stackdriver exporter
    Graphite exporter (official)
    Heka dashboard exporter
    Heka exporter
    InfluxDB exporter (official)
    JavaMelody exporter
    JMX exporter (official)
    Munin exporter
    Nagios / Naemon exporter
    New Relic exporter
    NRPE exporter
    Osquery exporter
    OTC CloudEye exporter
    Pingdom exporter
    scollector exporter
    Sensu exporter
    SNMP exporter (official)
    StatsD exporter (official)
# Miscellaneous---其他
    ACT Fibernet Exporter
    Bamboo exporter
    BIG-IP exporter
    BIND exporter
    Bitbucket exporter
    Blackbox exporter (official)
    BOSH exporter
    cAdvisor
    Cachet exporter
    ccache exporter
    Confluence exporter
    Dovecot exporter
    eBPF exporter
    Ethereum Client exporter
    Jenkins exporter
    JIRA exporter
    Kannel exporter
    Kemp LoadBalancer exporter
    Kibana Exporter
    Meteor JS web framework exporter
    Minecraft exporter module
    PHP-FPM exporter
    PowerDNS exporter
    Presto exporter
    Process exporter
    rTorrent exporter
    SABnzbd exporter
    Script exporter
    Shield exporter
    SMTP/Maildir MDA blackbox prober
    SoftEther exporter
    Transmission exporter
    Unbound exporter
    Xen exporter
# Software exposing Prometheus metrics---Prometheus度量指标
    App Connect Enterprise
    Ballerina
    Ceph
    Collectd
    Concourse
    CRG Roller Derby Scoreboard (direct)
    Docker Daemon
    Doorman (direct)
    Etcd (direct)
    Flink
    FreeBSD Kernel
    Grafana
    JavaMelody
    Kubernetes (direct)
    Linkerd
 

Six: Prometheus monitoring of kubernetes

For Kubernetes, we can divide all the resources into several categories:

  • Infrastructure layer (Node): cluster nodes, providing runtime resources for the entire cluster and applications
  • Container infrastructure (Container): provides a runtime environment for applications
  • User application (Pod): A Pod will contain a set of containers that work together and provide a function (or a set) to the outside world
  • Internal service load balancing (Service): In the cluster, use Service to expose application functions in the cluster, and provide internal load balancing when accessing between applications in the cluster and between applications
  • External access entrance (Ingress): Provide access entrance outside the cluster through Ingress, so that external clients can access services deployed in the Kubernetes cluster

Therefore, if we want to build a complete monitoring system, we should consider the following five aspects:

  • Cluster node status monitoring: Obtain the basic running status of nodes from the kubelet service of each node in the cluster;
  • Cluster node resource usage monitoring: deploy Node Exporter on each node in the cluster in the form of Daemonset to collect resource usage of nodes;
  • Container monitoring running in a node: Obtain the running status and resource usage of all containers in a node through the built-in cAdvisor of kubelet in each node;
  • If the application deployed in the cluster itself has built-in monitoring support for Prometheus, then we should also find the corresponding Pod instance and obtain the monitoring indicators of its internal running status from the Pod instance.
  • Monitor the components of k8s itself: apiserver, scheduler, controller-manager, kubelet, kube-proxy

Seven: Prometheus alarm processing

7.1 Introduction to Prometheus Alarms

The alerting capability is divided into two independent parts in the Prometheus architecture. As shown below, by defining the AlertRule (alert rule) in Prometheus, Prometheus will periodically calculate the alert rule, and if the trigger condition of the alert is met, the alert information will be sent to the Alertmanager.

An alert rule in Prometheus mainly consists of the following parts:

  • Alarm name: The user needs to name the alarm rule. Of course, for naming, it needs to be able to directly express the main content of the alarm
  • Alerting rules: Alerting rules are actually mainly defined by PromQL, and its actual meaning is how long the expression (PromQL) query results last (During) to trigger an alert
  • In Prometheus, a group of related alarms can also be uniformly defined through Group (alarm group). Of course, these definitions are managed uniformly through YAML files.

As an independent component, Alertmanager is responsible for receiving and processing alarm information from Prometheus Server (or other client programs). Alertmanager can further process these alarm information. For example, when a large number of repeated alarms are received, it can eliminate duplicate alarm information, and at the same time group the alarm information and route it to the correct notification party. Prometheus has built-in support for email, Slack, etc. The notification method is supported, and the integration with Webhook is also supported to support more customized scenarios. For example, Alertmanager does not currently support DingTalk, so users can integrate DingTalk with the robot through Webhook, so as to receive alarm information through DingTalk. At the same time, AlertManager also provides silence and alarm suppression mechanisms to optimize the alarm notification behavior.

7.2 Alertmanager features

In addition to providing basic alarm notification capabilities, Alertmanager also mainly provides alarm features such as grouping, suppression, and silence:

7.2.1 Grouping

A grouping mechanism can combine detailed alarm information into one notification. In some cases, for example, a large number of alarms are triggered at the same time due to system downtime. In this case, the grouping mechanism can combine these triggered alarms into one alarm notification to avoid receiving a large number of alarm notifications at one time. Quickly locate the problem.

For example, when there are hundreds of running service instances in the cluster, and alarm rules are set for each instance. If a network failure occurs at this time, a large number of service instances may not be able to connect to the database, and hundreds of alerts will be sent to Alertmanager as a result.

As a user, you may only want to be able to see which service instances are affected in one notification. At this time, the alarms can be grouped according to the cluster where the service is located or the alarm name, and these alarms can be grouped together to form a notification.
Alarm grouping, alarm time, and alarm receiving method can be configured through the configuration file of Alertmanager.

7.2.2 Inhibition

Suppression refers to the mechanism that when an alarm is sent, it can stop sending other alarms triggered by this alarm repeatedly.

For example, if an alert is triggered when the cluster is unreachable, all other alerts related to the cluster can be ignored by configuring Alertmanager. This avoids receiving a flood of alert notifications that are not related to the actual problem.

The suppression mechanism is also set through the configuration file of Alertmanager.

7.2.3 Silence

Silencing provides a simple mechanism to quickly silence alerts based on tags. If the received alert conforms to the silent configuration, Alertmanager will not send an alert notification.

Silent settings need to be set on the Werb page of Alertmanager.


Eight: Summary

Prometheus is a one-stop monitoring and alarming platform with less dependencies and complete functions.
Prometheus supports monitoring of clouds or containers, and other systems mainly monitor hosts.
Prometheus data query statements are more expressive and have more powerful built-in statistical functions.
Prometheus is not as good as InfluxDB, OpenTSDB, and Sensu in terms of data storage scalability and persistence.

Guess you like

Origin blog.csdn.net/ver_mouth__/article/details/126270966