Cloud-native docker container monitoring details (cAdvisor, node exporter, prometheus)

I. Introduction

cAdvisor source code
node exporter source code

prometheus:
official document
PromQL document
prometheus source code
alarm manager document
alarm manager source code
Chinese document

2. cAdvisor

cAdvisor lets container users understand the resource usage and performance characteristics of containers. Used to collect, aggregate, process and export information about running containers. It keeps resource isolation parameters, historical resource usage, full historical resource usage histograms, and network statistics for each container.

In short: Real-time monitoring and performance data collection of containers, including the usage of CPU, memory, network, file system and other resources.

2.1. Install cAdvisor

  1. Download the binaries:
wget  https://github.com/google/cadvisor/releases/download/v0.46.0/cadvisor-v0.46.0-linux-amd64
  1. Write a Dockerfile to build the container (Dockerfile).
# ubuntu作为基础镜像
FROM ubuntu:latest
LABEL cadvisor 0.46.0
# 将下载的二进制文件复制到容器里
COPY ./cadvisor-v0.46.0-linux-amd64 /usr/bin/cadvisor
# 赋予权限
RUN chmod +x /usr/bin/cadvisor
# 指定程序入口,这里使用ENTERYPOINT而不使用CMD的原因是cadvisor启动时有很多的启动参数,
# 使用CMD会需要指定太多参数,不够简洁。
ENTRYPOINT ["/usr/bin/cadvisor"]
  1. Build the image.
docker build -t cadvisor:0.46.0 .
  1. Run the container.
docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=8080:8080 \
--detach=true \
--name=cadvisor \
--userns=host \
--privileged \
--device=/dev/kmsg \
cadvisor:0.46.0
  1. Web access monitoring icon.
http://localhost:8080

insert image description here
insert image description here

2.2. Use Prometheus to monitor cAdvisor

cAdvisor exposes container and hardware statistics as metrics out of the box with Prometheus. By default, these metrics are under the /metrics path of the http endpoint. For example: http://192.168.0.106:8080/metrics. This endpoint can be customized by setting the
-prometheus_endpoint and -disable_metrics or -enable_metrics command line flags:

  1. -disable_metrics: A comma-separated list of metrics to disable. Options include:
    accelerator, advtcp, app, cpu, cpuLoad, cpu_topology, cpuset, disk, diskIO, hugetlb, memory, memory_numa, network, oom_event, percpu, perf_event, process, referenced_memory, resctrl, sched, tcp, udp. Default:
    advtcp, cpu_topology, cpuset, hugetlb, memory_numa, process, referenced_memory, resctrl, sched, tcp, udp.
  2. -enable_metrics: A comma-separated list of metrics to enable, overrides the -disable_metrics option if set. Options include:
    accelerator, advtcp, app, cpu, cpuLoad, cpu_topology, cpuset, disk, diskIO, hugetlb, memory, memory_numa, network, oom_event, percpu, perf_event, process, referenced_memory, resctrl, sched, tcp, udp.
  3. -prometheus_endpoint: Endpoint to expose Prometheus metrics (defaults to "/metrics").
  4. Example:
docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=8080:8080 \
--detach=true \
--name=cadvisor \
--userns=host \
--privileged \
--device=/dev/kmsg \
cadvisor:0.46.0 -disable_metrics cpu,cpuLoad

2.3. Prometheus indicators exposed by cAdvisor

Container metrics:

  1. document.
  2. index:
    insert image description here

Hardware indicators:

  1. document.
  2. index:
    insert image description here

3. Node Exporter

Node Exporter is an agent officially provided by prometheus, and the project is hosted under the account of prometheus. Used to collect hardware and operating system metrics for hosts.

3.1. Install Node Exporter

  1. Start the container, the default port is 9100
# 安装Node Exporter 来收集硬件信息
docker run -d \
--net="host" \
--pid="host" \
--userns="host" \
-v "/:/host:ro,rslave" \
--name node_exporter \
quay.io/prometheus/node-exporter:latest \
--path.rootfs=/host
  1. Visit the http endpoint to view metrics
http://192.168.0.106:9100/metrics

insert image description here

  1. --collector.enables metrics, --no-collector.disables metrics, --collector.disable-defaults disables all default enabled metrics. For example:
docker run -d \
--net="host" \
--pid="host" \
--userns="host" \
-v "/:/host:ro,rslave" \
--name node_exporter \
quay.io/prometheus/node-exporter:latest \
--path.rootfs=/host \
--collector.disable-defaults \
--collector.arp --collector.bcache

3.2. Indicators

Metrics are enabled by default:

  1. document
  2. index:
    insert image description here

Indicators are disabled by default:

  1. document
  2. Reasons for disabling indicators: high cardinality; running time exceeds the time set by Prometheus scrap_interval or scrap_timeout; consumes a lot of host resources. Therefore, you need to be cautious when enabling the default disabled indicators, and enable them as needed.
  3. index:
    insert image description here

4. Prometheus

An open source monitoring and alarm system collects the data from the acquisition end regularly, and stores them in the time series database after calculation. Calculate the indicators in the time series database through PromQL, so as to analyze the state of the system. Alerts are triggered by periodically evaluating specified PromQL-based expressions.

4.1. Installation

(1) Configuration file (promethus.yml), configuration writing can refer to the instructions on the official website .

global:
	# 每20s获取一次数据指标
	scrape_interval: 20s
	# 获取数据超时时长 10s
	scrape_timeout: 10s
	# 规则评估评率,即计算指标是否有触发规则的计算频率
	evaluation_interval: 20s
# 规则文件,从所有匹配的文件中读取规则和警报
rule_files:
	- "alertRule.yml"
	- "recordRule.yml"
# 采集配置列表
scrape_configs:
- job_name: 'cadvisor'
	static_configs:
	- targets:
		- 192.168.0.106:8080
- job_name: 'node'
	static_configs:
	- targets:
		- 192.168.0.106:9100
		- 192.168.0.142:9100
		- 192.168.0.143:9100
- job_name: 'prometheus'
	static_configs:
	- targets:
		- 192.168.0.106:9090
# 报警管理
alerting:
	alertmanagers:
	- static_configs:
 		- targets: ['192.168.20.106:9093']

(2) Start the container.

docker run -itd --name prometheus -p 9090:9090 \
-v /opt/prometheus:/etc/prometheus \
prom/prometheus --config.file=/etc/prometheus/prometheus.yml

(3) Access endpoint: http://192.168.0.106:9090
(4) Indicator type:

  • Counter: A counter that only increases but not decreases, and is used to describe the cumulative status of a certain indicator. For example, the total usage time of cpu: node_cpu_seconds_total
  • Gauge: A meter that can be increased or decreased, used to describe the current state of an indicator, such as free memory space: node_memory_MemFree_bytes

(5) 5-minute CPU usage expression, 1 - incremental idle CPU within 5 minutes/incremental total CPU within 5 minutes, grouped by instance . Since the node_cpu_seconds_total indicator is a counter type, this indicator has been accumulating CPU usage, so it is necessary to obtain the amount of CPU in increments. The expression is as follows:

100- sum(increase(node_cpu_seconds_total{
    
    mode="idle"}[5m])) by (instance)/sum(increase(node_cpu_seconds_total[5m])) by (instance) * 100

(6) Machine average load , node_load1 1-minute average load, node_load5 5-minute average load, node_load15 15-minute average load.

node_load1
node_load5
node_load15

(7) Memory usage , node_memory_MemTotal_bytes total memory, node_memory_MemFree_bytes free memory, node_memory_Buffers_bytes buffer cache, node_memory_Cached_bytes page cache. Formula: Total Memory - (Free Memory + Buffer Cache + Page Cache)) / Total Memory * 100.

(node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes+node_memory_Cached_bytes ))/node_memory_MemTotal_bytes * 100

(8) Disk space usage , node_filesystem_avail_bytes available bytes, node_filesystem_size_bytes total bytes.

node_filesystem_avail_bytes{
    
    mountpoint="/"} / node_filesystem_size_bytes{
    
    mountpoint="/"} * 100

4.2, rule configuration

(1) Rule check:

promtool check rules /path/to/example.rules.yml

(2) Record rules:

roups:
- name: RecordCpu
	rules:
	- record: Cpu15mRate
		expr: 100- sum(increase(node_cpu_seconds_total{
    
    mode="idle"}[15m])) by (instance)/sum(increase(node_cpu_seconds_total[15m])) by (instance) * 100
		labels:
 			CpuRate: 15

(3) Alarm rules:

groups:
# 组名
- name: node_health
	# 规则
	rules:
 	# 报警名称
	- alert: InstanceDown
 		# 基于PromQL的条件表达式
 		expr: up == 0
 		# 评估等待时间,表示,触发条件表达式后,等待一段时间发送报警信息
 		for: 1m
 		# 自定义label 标签
 		labels:
  			NodeHealth: false
 		# 附加信息,比如详细的描述报警情况
 		annotations:
  			# 摘要
  			summary: "Instance {
    
    { $labels.instance }} down"
  			# 详情  
  			description: " {
    
    { $labels.instance }} of job {
    
    { $labels.job }} has been down for more than 1 minutes "
- name: node_resource
	rules:
	- alert: Cpu5mRate
 	expr: 100- sum(increase(node_cpu_seconds_total{
    
    mode="idle"}[5m])) by (instance)/sum(increase(node_cpu_seconds_total[5m])) by (instance) * 100 > 2
 	labels:
  		CpuRate: hight
 	annotations:
  		# 摘要
  		summary: "Instance {
    
    { $labels.instance }} 5分钟CPU使用率过高"
  		# 详情
  		description: " {
    
    { $labels.instance }} of job {
    
    { $labels.job }} 5分钟CPU使用率过高 "
- alert: NodeLoad15
 expr: node_load15 > 0.8
 labels:
  NodeLoad15: hight
 annotations:
  # 摘要
  summary: "Instance {
    
    { $labels.instance }} 15分钟平均负载过高请留意"
  # 详情
  description: " {
    
    { $labels.instance }} of job {
    
    { $labels.job }} 15分钟平均负载过高 "
- alert: MemRate
 expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes+node_memory_Cached_bytes ))/node_memory_MemTotal_bytes * 100 > 20
 	labels:
  	NodeMemRate: hight
 	annotations:
	  # 摘要
	  summary: "机器内存使用率过高"
	  # 详情
	  description: "机器内存使用率超过20%,请留意"
- alert: DiskRate
 	expr: node_filesystem_avail_bytes{
    
    mountpoint="/"} / node_filesystem_size_bytes{
    
    mountpoint="/"} * 100 > 80
 	labels:
  		DiskRate: hight
 	annotations:
	  # 摘要
	  summary: "机器磁盘使用率过高"
	  # 详情
	  description: "机器磁盘使用率超过80%,请留意"

(4) Specify the rule file in the configuration file:

rule_files:
	- "alertRule.yml"
	- "recordRule.yml"

4.3, alarm manager

The alarm manager is responsible for receiving the alarms generated by prometheus and managing the alarm messages.
For example:

  • Deduplication: Deduplication of multiple identical alarms triggered at the same time.
  • Grouping: All alarm information of the same group will be combined into one alarm notification to avoid receiving a large number of alarm notifications at one time.
  • Routing: Routing can be configured according to the situation to notify the operation and maintenance personnel of different roles.
  • Suppression: When a certain warning is issued, other warnings caused by this warning can be stopped from being repeatedly sent.
  • Silent: Silent tabs will not be notified of warnings.

(1) Start the alarm manager.

docker run --name alertmanager -d -p 9093:9093 quay.io/prometheus/alertmanager

(2) Add configuration to the prometheus configuration file.

# 报警管理
alerting:
	alertmanagers:
		- static_configs:
 		- targets: ['192.168.0.106:9093']

(3) Visit http://192.168.0.106:9093 to view the alarm information.

Five, grafana

An open source monitoring system Web UI that supports multiple data sources. Support custom kanban, as well as the use of official existing data kanban.

  1. source address
  2. official website
  3. Installation and use, default user name: admin default password: admin.
docker run -d -p 3000:3000 --name=grafana -v /var/lib/grafana grafana/grafana-enterprise
  1. Several existing grafana templates:
1860 、9276、193、11600
  1. Go to the official website to choose a template:
    insert image description here
    insert image description here

Seeking attention

Guess you like

Origin blog.csdn.net/Long_xu/article/details/129478370