[Monitoring] Prometheus (Prometheus) monitoring overview

1. Introduction to monitoring system

The monitoring system here specifically refers to the monitoring of the data center, mainly monitoring and alarming the hardware and software in the data center. The IT architecture of enterprises is gradually migrating from traditional physical servers to IaaS clouds dominated by virtual machines. No matter how the infrastructure is adjusted, it is inseparable from the support of the monitoring system.

Not only that. The increasingly complex data center environment puts higher and higher requirements on the monitoring system: different objects need to be monitored, such as containers, distributed storage, SDN networks, and distributed systems. There are many kinds of applications, etc., and it is also necessary to collect and store a large amount of monitoring data, such as the collection and summary of several terabytes of data every day. And intelligent analysis, alarm and early warning based on these monitoring data.

In each enterprise's data center, some open source or commercial monitoring systems are more or less used. From the perspective of monitoring objects, monitoring can be divided into network monitoring, storage monitoring, server monitoring and application monitoring, because it is necessary to monitor all aspects of the data center. Therefore, the monitoring system needs to be comprehensive and act as the "eye of the sky" in the data center.

insert image description here

image-20200411183117435

2. Basic resource monitoring

2.1. Network monitoring

Network performance monitoring: mainly involves network monitoring, network real-time traffic monitoring (network delay, traffic, success rate) and historical data statistics, summary and historical data analysis and other functions.

Network detection*: Mainly for network attacks on the intranet or extranet. Such as DDoS. Identify network behavior by analyzing anomalous traffic.

Device monitoring: mainly monitors various network devices in the data center. Including hardware devices such as routers, firewalls and switches, data can be collected through protocols such as snmp.

2.2. Storage monitoring

In terms of storage performance monitoring: storage usually monitors the read and write rate of blocks, IOPS. Read and write latency, disk usage, etc.; file storage usually monitors file system inodes. Read and write speed, directory permissions, etc.

Storage system monitoring: Different storage systems have different indicators. For example, for ceph storage, it is necessary to monitor the operating status of OSD and MON, the number of pgs in various states, and cluster IOPS and other information.

Storage device monitoring: For storage devices built on x86 servers, device monitoring collects device information such as disks, SSDs, and network cards through collectors on each storage node; storage manufacturers provide commercial storage devices in a black box, usually by themselves With monitoring function, it can monitor the running status, performance and capacity of the equipment.

2.3. Server monitoring

CPU: involves the usage of the entire CPU, the percentage of user mode, the percentage of kernel mode, the usage of each CPU, the length of the waiting queue, the percentage of I/O waiting, the process with the most CPU consumption, the number of context switches, the cache hit rate, etc.

Memory: involves memory usage, remaining amount, process with the highest memory usage, swap partition size, page fault exception, etc.

Network I/O: involves the upstream traffic, downstream traffic, network delay, packet loss rate, etc. of each network card.

Disk I/O: involves the read/write rate of the hard disk, IOPS, disk usage, read/write delay, etc.

2.4, middleware monitoring

Message middleware: RabbitMQ, Kafka

Web service middleware: Tomcat, Jetty

Cache middleware: Redis, Memcached

Database middleware: MySQL, PostgreSQL

2.5. Application Monitoring (APM)

APM is mainly for application monitoring, including application running status monitoring, performance monitoring, log monitoring, and call chain tracking. Call chain tracking refers to tracking the entire request process (from the user sending a request, usually referring to the browser or application client) to the back-end API service and API service and associated middleware, or calls between other components to build a complete Not only that, but APM can also monitor the call hierarchy (Controller–>service–>Dao) of the component’s internal methods to obtain the execution time of each function, thereby providing data support for performance tuning.

In addition to Pinpoint, the application monitoring tools also include Twitter's open source Zipkin, Apache SkyWalking, Meituan's open source CAT, etc.

call key monitoring
insert image description here

Comparison of several products
image-20200411223426624

insert image description here

Pinpoint
insert image description here

insert image description here

In addition to intercepting method calls, APM can also intercept TCP and HTTP network requests, so as to obtain information about the methods and SQL statements that take the longest to execute, and the API with the greatest delay.

3. Introduction to Prometheus

3.1, what is Prometheus

Prometheus is an open source system monitoring and alarm framework. It was inspired by Google's borgmon monitoring system, created in 2012 by former google employees working at SoundCloud, developed as a community open source project, and officially released in 2015. In 2016, Prometheus officially joined the Cloud Native Computing Foundation, becoming the second most popular project after Kubernetes.

3.2. Advantages

Powerful multidimensional data model:

Time series data are distinguished by metric names and key-value pairs.
All metrics can be set with arbitrary multidimensional labels.
The data model is more casual and does not need to be deliberately set as a dot-separated string.
Aggregation, cutting and slicing operations can be performed on the data model.
Double-precision floating-point type is supported, and the label can be set to full unicode.
Flexible and powerful query statement (PromQL): In the same query statement, operations such as multiplication, addition, connection, and fractional digits can be performed on multiple metrics.

Easy to manage: Prometheus server is a separate binary file that can work directly locally without relying on distributed storage.

Efficient: Each sampling point only occupies 3.5 bytes on average, and a Prometheus server can handle millions of metrics.

Use the pull mode to collect time series data, which is not only conducive to local testing but also prevents problematic servers from pushing bad metrics.

The time series data can be pushed to the Prometheus server by means of push gateway.

Monitoring targets can be obtained through service discovery or static configuration.

There are various visual graphical interfaces.

Easy to stretch.

3.3. Components

The Prometheus ecosystem consists of several components, many of which are optional:

Prometheus Server: Used to collect and store time series data.
Client Library: The client library generates corresponding metrics for the services to be monitored and exposes them to the Prometheus server. When the Prometheus server comes to pull, it will directly return the real-time status metrics.
Push Gateway: Mainly used for short-term jobs. Since such jobs exist for a short time, they may disappear before Prometheus comes to pull. To this end, jobs can push their metrics directly to the Prometheus server this time. This method is mainly used for metrics at the service level. For metrics at the machine level, you need to use node exporter.
Exporters: Used to expose metrics of existing third-party services to Prometheus.
Alertmanager: After receiving alerts from the Prometheus server, it will remove duplicate data, group them, route them to the receiving method, and issue an alarm. Common receiving methods are: email, pagerduty, OpsGenie, webhook, etc.
some other tools.

3.4. Architecture

insert image description here

From this architecture diagram, we can also see that the main modules of Prometheus include Server, Exporters, Pushgateway, PromQL, Alertmanager, WebUI, etc.

It roughly uses logic like this:

  1. The Prometheus server periodically pulls data from statically configured targets or service-discovered targets.
  2. When the newly fetched data is larger than the configured memory cache, Prometheus will persist the data to disk (if remote storage is used, it will be persisted to the cloud).
  3. Prometheus can configure rules, and then query data regularly. When the condition is triggered, the alert will be pushed to the configured Alertmanager.
  4. When Alertmanager receives a warning, it can aggregate, deduplicate, denoise, and finally send a warning according to the configuration.
  5. Data can be queried and aggregated using the API, Prometheus Console or Grafana.

3.5. What scenarios are applicable

Prometheus is suitable for recording time series in text format, which is suitable for both machine-centric monitoring and monitoring of highly dynamic service-oriented architectures. In the world of microservices, its support for multidimensional data collection and query has special advantages. Prometheus is designed to improve system reliability. It can quickly diagnose problems during power outages. Each Prometheus Server is independent of each other and does not depend on network storage or other remote services. When the infrastructure fails, you can quickly locate the point of failure through Prometheus without consuming a lot of infrastructure resources.

3.6. What scenarios are not suitable

Prometheus takes reliability very seriously, and even in the event of a failure, you can always see available statistics about your system. If you need 100% accuracy, such as billing by the number of requests, then Prometheus is not suitable for you, because the data it collects may not be detailed and complete. In this case, you're better off using another system to collect and analyze data for billing purposes, and Prometheus to monitor the rest of the system.

4. Data Model

4.1. Data model

All the monitoring data collected by Prometheus are stored in the built-in time series database (TSDB) in the form of metrics: time-stamped data streams belonging to the same metric name, the same label set. In addition to stored time series, Prometheus can also generate temporary, derived time series as return results based on query requests.

Metric name and label
Each time series is uniquely identified by a metric name (Metrics Name) and a set of labels (key-value pairs). The name of the metric (metric name) can reflect the meaning of the monitored sample (for example, http_requests_total — indicates the total number of HTTP requests received by the current system), and the metric name can only be composed of ASCII characters, numbers, underscores, and colons, and must match Regular expression [a-zA-Z_:][a-zA-Z0-9_:]*.

[info] note

Colons are used to represent user-defined recording rules, and colons cannot be used to define indicator names in exporters or indicators directly exposed by monitoring objects.

By using labels, Prometheus opens a powerful multidimensional data model: for the same indicator name, a specific metric dimension instance will be formed through a collection of different label lists (for example: all http requests containing the metric name /api/tracks, marked with method=POST label, it will form a specific http request). The query language filters and aggregates based on these metrics and tag lists. Changing any label value on any metric (including adding or removing a metric) creates a new time series.

The name of the label can only be composed of ASCII characters, numbers and underscores and satisfy the regular expression [a-zA-Z_][a-zA-Z0-9_]*. The tags prefixed with __ are keywords reserved by the system and can only be used within the system. The tag's value can contain any Unicode-encoded characters.

Time series sample
Each point in the time series is called a sample, and the sample consists of the following three parts:

  • Indicator (metric): the indicator name and labelsets describing the characteristics of the current sample;
  • Timestamp (timestamp): a timestamp accurate to milliseconds;
  • Sample value (value): A folat64 floating-point data represents the value of the current sample.
    Format
    The time series of the specified indicator name and the specified label set are represented by the following expressions:
<metric name>{
    
    <label name>=<label value>, ...}

For example, a time series with the metric name api_http_requests_total and the label method="POST" and handler="/messages" can be expressed as:

api_http_requests_total{
    
    method="POST", handler="/messages"}

This is the same notation used in OpenTSDB.

4.2. Indicator types
The Prometheus client library provides four core indicator types. However, these types are only in the client library (clients can call different API interfaces according to different data types) and online protocols. In fact, the Prometheus server does not distinguish between indicator types, but simply regards these indicators as uniform. Untyped time series. However, we will work hard to change this situation in the future.

Counter
is an accumulative metric. Typical applications include: the number of requests, the number of completed tasks, the number of errors, etc.
For example, http_requests_total in Prometheus server indicates the total number of http requests processed by Prometheus. We can use delta to easily get the increment of data in any interval. This will be detailed in the PromQL section.

insert image description here

Gauge

  • A conventional metric, typical applications such as: temperature, the number of running goroutines.
  • Can be added or subtracted arbitrarily.

For example, go_goroutines in Prometheus server indicates the number of current goroutines in Prometheus.

image-20200410152301768

Histogram

  • Can be understood as a histogram, typical applications such as: request duration, response size.
  • Observations can be sampled, grouped and counted.

For example, when querying prometheus_http_request_duration_seconds_sum{handler="/api/v1/query", instance="localhost:9090", job="prometheus"}, the returned results are as follows:

image-20200410152805983

Summary

  • Similar to Histogram, typical applications such as: request duration, response size.
  • Provides count and sum functions for observations.
  • Provides the percentile function, that is, the tracking results can be divided by percentage.
    4.3, instance and jobs
    In Prometheus, any independent data source (target) is called an instance (instance). A collection of instances of the same type is called a job. Here is a job with four repeated instances:
- job: api-server
    - instance 1: 1.2.3.4:5670
    - instance 2: 1.2.3.4:5671
    - instance 3: 5.6.7.8:5670
    - instance 4: 5.6.7.8:5671

Self-generated labels and timing

When Prometheus collects data, it will automatically add tags on the basis of time series as the identification of the data source (target) to distinguish:

  • job: The configured job name that the target belongs to.
  • instance: The : part of the target’s URL that was scraped.

If any of these labels already exists in the previously collected data, the new label will be determined according to the honor_labels setting option. See the official website for details: scrape configuration documentation

For each instance, Prometheus stores the collected data samples in the following sequence:

  • up{job="", instance=""}: 1 means the instance is working normally

  • up{job="", instance=""}: 0 means the instance is faulty

  • scrape_duration_seconds{job="", instance=""} indicates the time interval for pulling data

  • scrape_samples_post_metric_relabeling{job="", instance=""} indicates the number of samples remaining after relabeling operation

  • scrape_samples_scraped{job="", instance=""} indicates the number of samples obtained from this data source

Among them, the up sequence can be effectively used to monitor whether the instance is working normally.

5. Other monitoring tools

In the preface, we briefly introduced the reasons why we chose Prometheus and the benefits it brought us after using it.

Here it is mainly compared with other monitoring solutions, so that everyone can better understand Prometheus.

Prometheus vs Zabbix

  • Zabbix uses C and PHP, Prometheus uses Golang, and overall Prometheus runs a bit faster.
  • Zabbix belongs to traditional host monitoring and is mainly used for monitoring physical hosts, switches, networks, etc. Prometheus is not only suitable for host monitoring, but also for -Cloud, SaaS, Openstack, Container monitoring.
  • Zabbix has richer plugins for traditional host monitoring.
  • Zabbix can configure many things in WebGui, but Prometheus needs to manually modify the file configuration.
    Prometheus vs Graphite
  • Graphite has fewer functions. It focuses on two things, storing time series data and visualizing data. Other functions need to install related plug-ins, while Prometheus is a one-stop shop that provides common functions of alarm and trend analysis. It provides stronger data storage and Query capability.
  • Graphite does a better job in terms of horizontal expansion schemes and data storage cycles.
    Prometheus vs InfluxDB
  • InfluxDB is an open source time-series database, which is mainly used to store data. If you want to build a monitoring and alarm system, you need to rely on other systems.
  • InfluxDB does a better job in storage horizontal expansion and high availability, after all, the core is the database.
    Prometheus vs OpenTSDB
  • OpenTSDB is a distributed time-series database that relies on Hadoop and HBase to store longer-term data. If your system has already run Hadoop and HBase, it is a good choice.
  • If you want to build a monitoring and alarm system, OpenTSDB needs to rely on other systems.
    Prometheus vs Nagios
  • Nagios data does not support custom Labels, does not support query, and alarms do not support denoising, grouping, and no data storage. If you want to query historical status, you need to install a plug-in.
  • Nagios is a monitoring system in the 1990s. It is more suitable for monitoring small clusters or static systems. Obviously, Nagios is too old and does not have many features. Compared with Prometheus, it is much better.
    Prometheus vs Sensu
  • In a broad sense, Sensu is an upgraded version of Nagios. It solves many problems of Nagios. If you are familiar with Nagios, using Sensu is a good choice.
  • Sensu relies on RabbitMQ and Redis, which has better scalability in data storage.
    Summarize
  • Prometheus is a one-stop monitoring and alarming platform with less dependencies and complete functions.
  • Prometheus supports monitoring of clouds or containers, and other systems mainly monitor hosts.
  • Prometheus data query statements are more expressive and have more powerful built-in statistical functions.
  • Prometheus is not as good as InfluxDB, OpenTSDB, and Sensu in terms of data storage scalability and persistence.

6. Export

6.1, text format

Before discussing the Exporter, it is necessary to introduce the Prometheus text data format, because an Exporter essentially converts the collected data into the corresponding text format and provides http requests.

The text content of the data conversion collected by the Exporter is in units of lines (\n), blank lines will be ignored, and the last line of the text content is a blank line

Comment
Text content, if it starts with #, it usually means a comment.

  • Beginning with # HELP means metric help instructions.
  • Starting with # TYPE means defining the metric type, including counter, gauge, histogram, summary, and untyped types.
  • Others represent general comments, for reading purposes, and will be ignored by Prometheus.
    Sample data
    If the content does not start with #, it means sample data. It usually follows the type definition line and follows the format:
metric_name [
  "{" label_name "=" `"` label_value `"` {
    
     "," label_name "=" `"` label_value `"` } [ "," ] "}"
] value [ timestamp ]

Here is a complete example:

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{
    
    method="post",code="200"} 1027 1395066363000
http_requests_total{
    
    method="post",code="400"}    3 1395066363000

# Escaping in label values:
msdos_file_access_time_seconds{
    
    path="C:\\DIR\\FILE.TXT",error="Cannot find file:\n\"FILE.TXT\""} 1.458255915e9

# Minimalistic line:
metric_without_timestamp_and_labels 12.47

# A weird metric from before the epoch:
something_weird{
    
    problem="division by zero"} +Inf -3982045

# A histogram, which has a pretty complex representation in the text format:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{
    
    le="0.05"} 24054
http_request_duration_seconds_bucket{
    
    le="0.1"} 33444
http_request_duration_seconds_bucket{
    
    le="0.2"} 100392
http_request_duration_seconds_bucket{
    
    le="0.5"} 129389
http_request_duration_seconds_bucket{
    
    le="1"} 133988
http_request_duration_seconds_bucket{
    
    le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320

# Finally a summary, which has a complex representation, too:
# HELP rpc_duration_seconds A summary of the RPC duration in seconds.
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{
    
    quantile="0.01"} 3102
rpc_duration_seconds{
    
    quantile="0.05"} 3272
rpc_duration_seconds{
    
    quantile="0.5"} 4773
rpc_duration_seconds{
    
    quantile="0.9"} 9001
rpc_duration_seconds{
    
    quantile="0.99"} 76656
rpc_duration_seconds_sum 1.7560473e+07
rpc_duration_seconds_count 2693

It is important to note that, assuming that the sampling data metric is called x, if x is a histogram or summary type, the following conditions must be met:

  • The sum of the sampled data shall be expressed as x_sum.
  • The total amount of sampled data shall be expressed as x_count.
  • The quantile of sampled data of summary type shall be expressed as x{quantile="y"}.
  • Sampling bucket statistics of type histogram will be represented as x_bucket{le="y"}.
  • The sample of histogram type must contain x_bucket{le="+Inf"}, whose value is equal to the value of x_count.
  • Quantile and le in summary and historam must be arranged in descending order.

6.2. Common queries

After collecting the data of node_exporter, we can use PromQL to perform some business queries and monitoring. The following are some common queries.

Note: The following queries use a single node as an example. If you want to view all nodes, just remove instance="xxx".

CPU usage

100 - (avg by (instance) (irate(node_cpu_seconds_total{
    
    mode="idle"}[5m])) * 100)

CPU mode ratio

avg by (instance, mode) (irate(node_cpu_seconds_total[5m])) * 100

machine load average

node_load1{
    
    instance="xxx"} // 1分钟负载
node_load5{
    
    instance="xxx"} // 5分钟负载
node_load15{
    
    instance="xxx"} // 15分钟负载

memory usage

100 - ((node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes)/node_memory_MemTotal_bytes) * 100

disk usage

100 - node_filesystem_free{
    
    instance="xxx",fstype!~"rootfs|selinuxfs|autofs|rpc_pipefs|tmpfs|udev|none|devpts|sysfs|debugfs|fuse.*"} / node_filesystem_size{
    
    instance="xxx",fstype!~"rootfs|selinuxfs|autofs|rpc_pipefs|tmpfs|udev|none|devpts|sysfs|debugfs|fuse.*"} * 100

Or you can directly use {fstype="xxx"} to specify the disk information you want to view

Network IO
// uplink bandwidth

sum by (instance) (irate(node_network_receive_bytes{
    
    instance="xxx",device!~"bond.*?|lo"}[5m])/128)

// downlink bandwidth

sum by (instance) (irate(node_network_transmit_bytes{
    
    instance="xxx",device!~"bond.*?|lo"}[5m])/128)

NIC outgoing/incoming packets
// incoming packets

sum by (instance) (rate(node_network_receive_bytes{
    
    instance="xxx",device!="lo"}[5m]))

// Outgoing packet volume

sum by (instance) (rate(node_network_transmit_bytes{
    
    instance="xxx",device!="lo"}[5m]))

Guess you like

Origin blog.csdn.net/u011397981/article/details/128953500