Article Directory
foreword
Prometheus is an open source system monitoring and alerting toolkit developed in GO language . It was originally released by SoundCloud in 2012, and was later used by many large companies and organizations. It joined the Cloud Native Computing Foundation in 2016. Foundation, CNCF), and graduated in 2018, it is now an independent open source project and maintained independently of any company.
Prometheus is a very good monitoring tool, to be precise, it is a monitoring solution. Prometheus provides a complete set of monitoring solutions for monitoring data collection, storage, processing visualization and alarming.
-
Official website: https://prometheus.io/
-
github address: https://github.com/prometheus/prometheus
Grafana is an open source cross-platform measurement analysis and visualization tool that supports multiple data sources, such as Prometheus, Elasticsearch, InfluxDB, etc. It provides a wealth of visual charts and panels that can help users better understand and analyze monitoring data.
- Document address: https://grafana.com/docs/grafana/latest/
- github address: https://github.com/grafana/grafana
Prometheus itself carries a web UI to display data charts, but it is a bit crude, and Grafana can support the creation and display of beautiful charts, and it supports Prometheus itself, so the classic monitoring solution is Prometheus + Grafana.
1. Concept
1.1 Development
The development of operation and maintenance monitoring can be traced back to the early stage of computer technology. With the continuous development of computer technology, operation and maintenance monitoring is also constantly evolving and improving.
- Initial O&M monitoring was primarily done by manually checking system logs and performance metrics, which was time-consuming and error-prone.
- With the continuous development of computer technology, some monitoring tools based on SNMP protocol have appeared, such as Nagios, Zabbix and so on. These tools can automatically collect system logs and performance indicators, analyze and process them, so as to help users better understand the operating status of the system.
- In recent years, with the continuous development of cloud computing and container technology, operation and maintenance monitoring is also constantly evolving and improving. For example, Prometheus is an open source monitoring system based on cloud native technology, which can help users better manage and monitor cloud native applications.
In short, with the continuous development of computer technology, operation and maintenance monitoring is also constantly evolving and improving. From the initial manual inspection to monitoring tools based on SNMP protocol, to the current automatic operation and maintenance monitoring tools, operation and maintenance monitoring provides users with Better service and support.
1.2 Time series data
Time series data, that is, time series data (Time Series Data), data recorded and indexed in order of time dimension. Various types of equipment in fields such as the Internet of Things, the Internet of Vehicles, and the Industrial Internet will generate massive amounts of time-series data, which will account for more than 90% of the world's total data. On the monitoring platform, time series data often refers to sequential data with time stamps such as system performance indicators and log information.
The comparison between time series data and traditional relational data, time series data focuses on CR in CRUD, without U
1.3 Metric
Metric (measurement, indicator) is a very important concept. It appears very frequently in operation and maintenance monitoring. It refers to indicators in the monitoring system, such as CPU usage, memory usage, network traffic, etc. In Prometheus, its essence refers to A record exists in the database.
It can be divided into 4 types in Prometheus client
- Counter: A cumulative metric that represents a monotonically increasing counter whose value can only be increased or reset to zero on a restart. For example, you can use counters to represent the number of requests serviced, tasks completed, or errors.
- Gauge: A single value that can fluctuate arbitrarily. Meters are typically used for measured values, such as temperature or current memory usage, but are also used for "counts" that may fluctuate up and down, such as the number of concurrent requests.
- Histogram: Histogram, which represents the statistical results of data sampling over a period of time, and the distribution of samples is counted by bucketing. For example, count the time-consuming of the interface, how many requests fall in 10ms - 20ms, how many requests fall in 20ms - 30ms, etc.
- Summary: Similar to Histogram, the percentile is calculated based on the sample. For example, statistics link time consumption, TP99, TP95, etc.
2. Prometheus
2.1 Architecture
-
Prometheus Server : Use the service discovery mechanism to obtain the target that needs to be monitored, and pull the indicator data from the target through the Pull method. According to the defined rule, the indicator data can be calculated again in advance, and the trigger alarm is sent to the alertmanager component for collection and processing. Store time series data.
-
PushGateway : Each target host can report data to PushGateway, and then Prometheus server pulls data from pushgateway uniformly.
-
Exporters : Collect existing third-party service monitoring indicators and expose metrics. Prometheus supports a variety of exporters, through which metrics data can be collected and sent to the Prometheus server.
-
Alertmanager : The component sends corresponding notifications according to the alarm mode of the alarm. After the alerts are received from the Prometheus server, they will be deduplicated, grouped, and routed to the corresponding receiver, and an alarm will be issued. The common receiving methods are: email, WeChat, DingTalk, slack, etc.
-
Grafana : data visualization component, monitoring dashboard, querying data from Prometheus Server through PromQL, and displaying it
-
Prometheus web UI : simple web console, default port 9090
2.2 Configuration
Prometheus can load configuration files through the --config.file command option.
When --web.enable-lifecycle is enabled, a POST request can be sent through the URL /-/reload to load the configuration file without restarting Prometheus
Configuration document: https://prometheus.io/docs/prometheus/latest/configuration/configuration/ , the following are 4 commonly used configuration categories
-
global
Configure global information, such as monitoring data interval, business timeout period, alarm rule execution cycle, etc.
- scrape_interval The default time interval for pulling targets, the default is 1m
- scrape_timeout pull timeout, default 10s
- evaluation_interval executes the rules interval, the default is 1m
-
rule_files
Contains two rule files: record rule and alarm rule.
-
record rules
Recording rules allow expressions that are often required or expensive to compute to be precomputed and their results saved as a new set of time series. Querying for precomputed results is often much faster than executing the original expression each time it is needed. This is especially useful for dashboards that need to query the same expression repeatedly on each refresh.
Documentation: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
-
Alert rules
Define alert conditions based on PromQL and send notifications to external services about triggered alerts.
Documentation: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
-
-
alerting
Configuration Management Alertmanager
-
scrape_configs
Configure pull data node job, document: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config
- job_name job name
- scrape_interval crawling frequency, default globa.scrape_interval
- scrape_timeout crawl timeout, default globa.scrape_timeout
- metrics_path capture path, default /metric
- static_configs crawl target URL address
2.3 Query Language PromQL
Prometheus provides a functional query language called PromQL (Prometheus Query Language), which allows users to select and aggregate time series data in real time. Document address: https://prometheus.io/docs/prometheus/latest/querying/basics/
-
filter query
Use {} to filter the results. The internal label value can use = to indicate equality, != to indicate inequality, =~ to indicate a regular match, and !~ to indicate a regular mismatch
http_requests_total{ method="GET"} http_requests_total{ environment=~"staging|testing|development",method!="GET"} http_requests_total{ status!~"4.."}
-
range time query
Can be selected by [time value], for example
http_requests_total [5m]
-
Offset time query
offset allows changing the time offset of individual instant vectors and range vectors in a query. For example, the total number of http requests in the past 5 minutes
http_requests_total offset 5m
-
Fixed time query
@ allows changing the computation time of individual instant vectors and range vectors in a query. The time provided to the @ modifier is a unix timestamp represented as a floating point number
For example: return the value at 2021-01-04T07:40:00+00:00
http_requests_total @ 1609746000
-
aggregation query
Prometheus provides sum, max, min, avg, count, bottomk, topk and other aggregation commands to query data
sum(http_requests_total) sum by (application, group) (http_requests_total) topk(5, http_requests_total)
-
function query
Prometheus provides functions to participate in computing query data, document: https://prometheus.io/docs/prometheus/latest/querying/functions/
rate(http_requests_total[5m])[30m:1m]
2.4 Exporter
Prometheus obtains data through Exporter, which can be downloaded and installed on demand according to the link in the document: https://prometheus.io/docs/instrumenting/exporters/
3. Grafana
3.1 Data source
3.2 Permissions
Grafana provides a permission system that allows users to have different permissions according to different roles, such as panel viewing, editing, etc.
There are three types of permissions: admin, viewer, editor
Users can be added by inviting users and sending links to users, and can be notified through groups to view the permission control of the panel
3.3 Panel visualization
Related documentation: https://grafana.com/docs/grafana/latest/panels-visualizations/
-
panel query expression
-
Panel type
The most common one is Graph, more types can be downloaded and imported from the official website, pay attention to the next version
-
panel parameters
If the parameter of the y-axis is a percentage, it can be controlled as follows
3.4 Dashboard
The integration of the above multiple panels is the dashboard
-
import
In addition to customizing the panels, you can also use the panels made by others https://grafana.com/grafana/dashboards/ , import through the Import menu
-
Check
-
View dashboard can add parameters through url & kiosk hide sidebar and top menu
-
anonymous access
Modify the configuration file conf/defaults.ini
[auth.anonymous] # 设置为true即可匿名访问,不用登陆就可以直接访问url enabled = true
-
Nesting allowed
Modify the configuration file conf/defaults.ini
# 设置为true即可嵌套 allow_embedding = true
-
-
variable
Variables can be used to select the part you want to display in the drop-down list: Documentation: https://grafana.com/docs/grafana/latest/dashboards/variables/
Four, actual combat
4.1 Monitor Windows/Linux
windows: Download Exporter https://github.com/prometheus-community/windows_exporter/releases
linux: Download https://github.com/prometheus/node_exporter/releases
Let's take windows as an example
windows_exporter.exe --collectors.enabled "[defaults],process,container"
windows_exporter.exe --config.file config.yml
monitoring item
Monitoring indicators | expression |
---|---|
CPU usage | 100 - (avg by (instance,region) (irate(windows_cpu_time_total{mode=“idle”}[2m])) * 100) |
memory usage | 100-(windows_os_physical_memory_free_bytes/windows_cs_physical_memory_bytes)*100 |
total disk usage | (sum(windows_logical_disk_size_bytes{volume!~“Harddisk."}) by (instance) - sum(windows_logical_disk_free_bytes{volume!~"Harddisk.”}) by (instance)) / sum(windows_logical_disk_size_bytes{volume!~"Harddisk."}) by (instance) *100 |
Individual disk usage | 100- 100 * (windows_logical_disk_free_bytes/windows_logical_disk_size_bytes) |
bandwidth | (sum(irate(windows_net_bytes_total[1m])) > 1)* 8 |
system thread | windows_system_threads |
system process | windows_os_processes |
4.2 Monitoring JVMs
Download Exporter: https://github.com/prometheus/jmx_exporter/releases
java -javaagent:jmx_prometheus_javaagent-0.18.0.jar=12345:config.yml -jar vhr-web-0.0.1-SNAPSHOT.jar
rules:
- pattern: ".*"
Monitoring indicators | expression |
---|---|
jvm heap memory usage | jvm_memory_bytes_used{area=“heap”} |
Eden area use | jvm_memory_pool_bytes_used{pool=“PS Eden Space”} |
Old area use | jvm_memory_pool_bytes_used{pool=“PS Old Gen”} |
Metaspace usage | jvm_memory_pool_bytes_used{pool=“Metaspace”} |
gc time | increase(jvm_gc_collection_seconds_sum[$__interval]) |
gc growth times | increase(jvm_gc_collection_seconds_count[$__interval]) |
4.3 Monitoring MySQL
https://github.com/prometheus/mysqld_exporter/releases
mysqld_exporter.exe --config.my-cnf config.cnf --web.listen-address=localhost:9104
[client]
user=root
password=
Monitoring indicators | expression |
---|---|
Connections | sum(max_over_time(mysql_global_status_threads_connected[$__interval])) |
Number of slow queries | sum(rate(mysql_global_status_slow_queries[$__interval])) |
Average number of running threads | sum(avg_over_time(mysql_global_status_threads_running[$__interval])) |
Current QPS | rate(mysql_global_status_queries[$__interval]) |
4.4 Monitoring Springboot API
Sometimes in the Springboot project, it is necessary to count the number of calls and the call time of the API interface. You can use actuator+micrometer, which has two built-in annotations to realize both functions. Because you want to use aop, you need to import the aop package
Documentation: https://docs.spring.io/spring-boot/docs/current/reference/html/actuator.html#actuator.enabling
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-aop</artifactId>
</dependency>
management:
metrics:
tags:
application: ${
spring.application.name}
web:
server:
max-uri-tags: 200
endpoints:
web:
exposure:
include: prometheus
spring:
application:
name: prometheus-test-api
@Bean
public TimedAspect timedAspect(MeterRegistry registry) {
return new TimedAspect(registry);
}
@GetMapping("/test")
@Timed(value = "test_method",description = "测试接口耗时")
@Counted(value = "test_method", description = "测试接口次数")
public String test() {
//try {
// Thread.sleep(1000);
//} catch (InterruptedException e) {
// throw new RuntimeException(e);
//}
return "ok";
}