Simple use summary of commonly used monitoring solutions Prometheus + Grafana

foreword

​ Prometheus is an open source system monitoring and alerting toolkit developed in GO language . It was originally released by SoundCloud in 2012, and was later used by many large companies and organizations. It joined the Cloud Native Computing Foundation in 2016. Foundation, CNCF), and graduated in 2018, it is now an independent open source project and maintained independently of any company.

​ Prometheus is a very good monitoring tool, to be precise, it is a monitoring solution. Prometheus provides a complete set of monitoring solutions for monitoring data collection, storage, processing visualization and alarming.

​ Grafana is an open source cross-platform measurement analysis and visualization tool that supports multiple data sources, such as Prometheus, Elasticsearch, InfluxDB, etc. It provides a wealth of visual charts and panels that can help users better understand and analyze monitoring data.

Prometheus itself carries a web UI to display data charts, but it is a bit crude, and Grafana can support the creation and display of beautiful charts, and it supports Prometheus itself, so the classic monitoring solution is Prometheus + Grafana.

1. Concept

1.1 Development

The development of operation and maintenance monitoring can be traced back to the early stage of computer technology. With the continuous development of computer technology, operation and maintenance monitoring is also constantly evolving and improving.

  • Initial O&M monitoring was primarily done by manually checking system logs and performance metrics, which was time-consuming and error-prone.
  • With the continuous development of computer technology, some monitoring tools based on SNMP protocol have appeared, such as Nagios, Zabbix and so on. These tools can automatically collect system logs and performance indicators, analyze and process them, so as to help users better understand the operating status of the system.
  • In recent years, with the continuous development of cloud computing and container technology, operation and maintenance monitoring is also constantly evolving and improving. For example, Prometheus is an open source monitoring system based on cloud native technology, which can help users better manage and monitor cloud native applications.

In short, with the continuous development of computer technology, operation and maintenance monitoring is also constantly evolving and improving. From the initial manual inspection to monitoring tools based on SNMP protocol, to the current automatic operation and maintenance monitoring tools, operation and maintenance monitoring provides users with Better service and support.

1.2 Time series data

Time series data, that is, time series data (Time Series Data), data recorded and indexed in order of time dimension. Various types of equipment in fields such as the Internet of Things, the Internet of Vehicles, and the Industrial Internet will generate massive amounts of time-series data, which will account for more than 90% of the world's total data. On the monitoring platform, time series data often refers to sequential data with time stamps such as system performance indicators and log information.

The comparison between time series data and traditional relational data, time series data focuses on CR in CRUD, without U

1.3 Metric

Metric (measurement, indicator) is a very important concept. It appears very frequently in operation and maintenance monitoring. It refers to indicators in the monitoring system, such as CPU usage, memory usage, network traffic, etc. In Prometheus, its essence refers to A record exists in the database.

It can be divided into 4 types in Prometheus client

  • Counter: A cumulative metric that represents a monotonically increasing counter whose value can only be increased or reset to zero on a restart. For example, you can use counters to represent the number of requests serviced, tasks completed, or errors.
  • Gauge: A single value that can fluctuate arbitrarily. Meters are typically used for measured values, such as temperature or current memory usage, but are also used for "counts" that may fluctuate up and down, such as the number of concurrent requests.
  • Histogram: Histogram, which represents the statistical results of data sampling over a period of time, and the distribution of samples is counted by bucketing. For example, count the time-consuming of the interface, how many requests fall in 10ms - 20ms, how many requests fall in 20ms - 30ms, etc.
  • Summary: Similar to Histogram, the percentile is calculated based on the sample. For example, statistics link time consumption, TP99, TP95, etc.

2. Prometheus

2.1 Architecture

  • Prometheus Server : Use the service discovery mechanism to obtain the target that needs to be monitored, and pull the indicator data from the target through the Pull method. According to the defined rule, the indicator data can be calculated again in advance, and the trigger alarm is sent to the alertmanager component for collection and processing. Store time series data.

  • PushGateway : Each target host can report data to PushGateway, and then Prometheus server pulls data from pushgateway uniformly.

  • Exporters : Collect existing third-party service monitoring indicators and expose metrics. Prometheus supports a variety of exporters, through which metrics data can be collected and sent to the Prometheus server.

  • Alertmanager : The component sends corresponding notifications according to the alarm mode of the alarm. After the alerts are received from the Prometheus server, they will be deduplicated, grouped, and routed to the corresponding receiver, and an alarm will be issued. The common receiving methods are: email, WeChat, DingTalk, slack, etc.

  • Grafana : data visualization component, monitoring dashboard, querying data from Prometheus Server through PromQL, and displaying it

  • Prometheus web UI : simple web console, default port 9090

2.2 Configuration

Prometheus can load configuration files through the --config.file command option.

When --web.enable-lifecycle is enabled, a POST request can be sent through the URL /-/reload to load the configuration file without restarting Prometheus

Configuration document: https://prometheus.io/docs/prometheus/latest/configuration/configuration/ , the following are 4 commonly used configuration categories

  • global

    Configure global information, such as monitoring data interval, business timeout period, alarm rule execution cycle, etc.

    • scrape_interval The default time interval for pulling targets, the default is 1m
    • scrape_timeout pull timeout, default 10s
    • evaluation_interval executes the rules interval, the default is 1m
  • rule_files

    Contains two rule files: record rule and alarm rule.

  • alerting

    Configuration Management Alertmanager

  • scrape_configs

    Configure pull data node job, document: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config

    • job_name job name
    • scrape_interval crawling frequency, default globa.scrape_interval
    • scrape_timeout crawl timeout, default globa.scrape_timeout
    • metrics_path capture path, default /metric
    • static_configs crawl target URL address

2.3 Query Language PromQL

Prometheus provides a functional query language called PromQL (Prometheus Query Language), which allows users to select and aggregate time series data in real time. Document address: https://prometheus.io/docs/prometheus/latest/querying/basics/

  • filter query

    Use {} to filter the results. The internal label value can use = to indicate equality, != to indicate inequality, =~ to indicate a regular match, and !~ to indicate a regular mismatch

    http_requests_total{
          
          method="GET"}
    http_requests_total{
          
          environment=~"staging|testing|development",method!="GET"}
    http_requests_total{
          
          status!~"4.."}
    
  • range time query

    Can be selected by [time value], for example

    http_requests_total [5m]
    
  • Offset time query

    offset allows changing the time offset of individual instant vectors and range vectors in a query. For example, the total number of http requests in the past 5 minutes

    http_requests_total offset 5m
    
  • Fixed time query

    @ allows changing the computation time of individual instant vectors and range vectors in a query. The time provided to the @ modifier is a unix timestamp represented as a floating point number

    For example: return the value at 2021-01-04T07:40:00+00:00

    http_requests_total @ 1609746000
    
  • aggregation query

    Prometheus provides sum, max, min, avg, count, bottomk, topk and other aggregation commands to query data

    sum(http_requests_total)
    sum by (application, group) (http_requests_total)
    topk(5, http_requests_total)
    
  • function query

    Prometheus provides functions to participate in computing query data, document: https://prometheus.io/docs/prometheus/latest/querying/functions/

    rate(http_requests_total[5m])[30m:1m]
    

2.4 Exporter

Prometheus obtains data through Exporter, which can be downloaded and installed on demand according to the link in the document: https://prometheus.io/docs/instrumenting/exporters/

3. Grafana

3.1 Data source

grafana_datasource_01.png

3.2 Permissions

Grafana provides a permission system that allows users to have different permissions according to different roles, such as panel viewing, editing, etc.

There are three types of permissions: admin, viewer, editor

Users can be added by inviting users and sending links to users, and can be notified through groups to view the permission control of the panel

3.3 Panel visualization

Related documentation: https://grafana.com/docs/grafana/latest/panels-visualizations/

  • panel query expression

    grafana_03_create-panel_query.png

  • Panel type

    The most common one is Graph, more types can be downloaded and imported from the official website, pay attention to the next version

    grafana_04_panel_type.png

  • panel parameters

    If the parameter of the y-axis is a percentage, it can be controlled as follows

    grafana_05_panel_percent.png

3.4 Dashboard

The integration of the above multiple panels is the dashboard

  • import

    In addition to customizing the panels, you can also use the panels made by others https://grafana.com/grafana/dashboards/ , import through the Import menu

  • Check

    • View dashboard can add parameters through url & kiosk hide sidebar and top menu

    • anonymous access

      Modify the configuration file conf/defaults.ini

      [auth.anonymous]
      # 设置为true即可匿名访问,不用登陆就可以直接访问url
      enabled = true
      
    • Nesting allowed

      Modify the configuration file conf/defaults.ini

      # 设置为true即可嵌套
      allow_embedding = true
      
  • variable

    Variables can be used to select the part you want to display in the drop-down list: Documentation: https://grafana.com/docs/grafana/latest/dashboards/variables/

Four, actual combat

4.1 Monitor Windows/Linux

windows: Download Exporter https://github.com/prometheus-community/windows_exporter/releases

linux: Download https://github.com/prometheus/node_exporter/releases

Let's take windows as an example

windows_exporter.exe --collectors.enabled "[defaults],process,container"
windows_exporter.exe --config.file config.yml

grafana_02_windows.png

monitoring item

Monitoring indicators expression
CPU usage 100 - (avg by (instance,region) (irate(windows_cpu_time_total{mode=“idle”}[2m])) * 100)
memory usage 100-(windows_os_physical_memory_free_bytes/windows_cs_physical_memory_bytes)*100
total disk usage (sum(windows_logical_disk_size_bytes{volume!~“Harddisk."}) by (instance) - sum(windows_logical_disk_free_bytes{volume!~"Harddisk.”}) by (instance)) / sum(windows_logical_disk_size_bytes{volume!~"Harddisk."}) by (instance) *100
Individual disk usage 100- 100 * (windows_logical_disk_free_bytes/windows_logical_disk_size_bytes)
bandwidth (sum(irate(windows_net_bytes_total[1m])) > 1)* 8
system thread windows_system_threads
system process windows_os_processes

4.2 Monitoring JVMs

Download Exporter: https://github.com/prometheus/jmx_exporter/releases

java -javaagent:jmx_prometheus_javaagent-0.18.0.jar=12345:config.yml -jar vhr-web-0.0.1-SNAPSHOT.jar
rules:
- pattern: ".*"

grafana_06_panel_jvm.png

Monitoring indicators expression
jvm heap memory usage jvm_memory_bytes_used{area=“heap”}
Eden area use jvm_memory_pool_bytes_used{pool=“PS Eden Space”}
Old area use jvm_memory_pool_bytes_used{pool=“PS Old Gen”}
Metaspace usage jvm_memory_pool_bytes_used{pool=“Metaspace”}
gc time increase(jvm_gc_collection_seconds_sum[$__interval])
gc growth times increase(jvm_gc_collection_seconds_count[$__interval])

4.3 Monitoring MySQL

https://github.com/prometheus/mysqld_exporter/releases

mysqld_exporter.exe --config.my-cnf config.cnf --web.listen-address=localhost:9104
[client]
user=root
password=

grafana_08_mysql_01.png

Monitoring indicators expression
Connections sum(max_over_time(mysql_global_status_threads_connected[$__interval]))
Number of slow queries sum(rate(mysql_global_status_slow_queries[$__interval]))
Average number of running threads sum(avg_over_time(mysql_global_status_threads_running[$__interval]))
Current QPS rate(mysql_global_status_queries[$__interval])

4.4 Monitoring Springboot API

Sometimes in the Springboot project, it is necessary to count the number of calls and the call time of the API interface. You can use actuator+micrometer, which has two built-in annotations to realize both functions. Because you want to use aop, you need to import the aop package

Documentation: https://docs.spring.io/spring-boot/docs/current/reference/html/actuator.html#actuator.enabling

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-aop</artifactId>
</dependency>
management:
  metrics:
    tags:
      application: ${
    
    spring.application.name}
    web:
      server:
        max-uri-tags: 200
  endpoints:
    web:
      exposure:
        include: prometheus

spring:
  application:
    name: prometheus-test-api
@Bean
public TimedAspect timedAspect(MeterRegistry registry) {
    
    
	return new TimedAspect(registry);
}
@GetMapping("/test")
@Timed(value = "test_method",description = "测试接口耗时")
@Counted(value = "test_method", description = "测试接口次数")
public String test() {
    
    
    //try {
    
    
    //    Thread.sleep(1000);
    //} catch (InterruptedException e) {
    
    
    //    throw new RuntimeException(e);
    //}
    return "ok";
}

grafana_09_springboot.png

reference

  1. Prometheus+Grafana builds a comprehensive monitoring and alarm system

Guess you like

Origin blog.csdn.net/qq_23091073/article/details/130884263