kubernetes+docker monitoring
Docker's monitoring principle : According to the official statement of docker, it is not recommended to run multiple processes in a container, so it is not recommended to use an agent in the container for monitoring (zabbix, etc.), the agent should run on the host machine, and obtain monitoring data through cgroup or docker api .
1. Introduction to monitoring classification:
①, self-developed:
By calling docker's api interface, data is obtained, processed, and displayed, which is not introduced here.
E.g:
1), iQIYI refers to the dadvisor developed by cadvisor , the data is written to graphite , which is equivalent to cadvisor+influxdb , iQIYI 's dadvisor is not open source
②、Docker——cadvisor:
Google's cAdvisor is another well-known open source container monitoring tool.
Just deploy the cAdvisor container on the host, and users can access the performance data (CPU, memory, network, disk, file system, etc.) of the current node and container through the web interface or REST service, which is very detailed.
By default, cAdvisor caches data in memory , with limited data display capabilities; it also provides support for different persistent storage backends, which can save and aggregate monitoring data to Google BigQuery, InfluxDB or Redis.
In the new Kubernetes version, the cadvior function has been integrated into the kubelet component
It should be noted that the web interface of cadvisor can only see the information of the container on the single physical machine. Other machines need to access the url of the corresponding ip. When the number is small, it is very effective . When the number is large, it is more troublesome , so you need to put cadvisor To summarize and display the data, see [ cadvisor+influxdb+grafana ]
Ock 、 Docker —— Cadvisor + InfluxDB + Grafana :
Cadvisor : write data to InfluxDB
InfluxDB : Time series database, providing data storage , stored in the specified directory
Grafana : Provides a WEB console, custom query indicators , query data from InfluxDB, and display .
This combination is only for monitoring Docker, no kubernetes
④ 、 Governors —— Heapster + InfluxDB + Grafana :
Heapster : Obtain metrics and event data in the k8s cluster and write to InfluxDB . Heapster collects more data than cadvisor, but it is complete, and less is stored in influxdb.
InfluxDB : Time series database, providing data storage , stored in the specified directory.
Grafana : Provides a WEB console, custom query indicators , query data from InfluxDB, and display .
2. Cadvisor+Heapster+InfluxDB+Grafana Notes:
①, Cadvisor notes:
Cadvisor, you only need to enable Cadvisor and configure related information in the kubelet command .
Does not need to be started as a pod or command
--cadvisor-port=4194 --storage-driver-db="cadvisor" --storage-driver-host="localhost:8086" |
②, InfluxDB precautions:
1), Influxdb must be version 0.8.8 , otherwise, the Cadvisor log will appear:
E0704 14:29:14.163238 05655 memory.go:94] failed to write stats to influxDb - Server returned (404): 404 page not found |
http://blog.csdn.net/llqkk/article/details/50555442
It is said that Cadvisor does not support the 0.9 version of Influxdb , so 0.8.8 is used here, [ok]
Comparison table of different versions of Cadvisor and Influxdb (test ok):
Cadvisor version |
Influxdb version |
0.7.1 |
0.8.8 |
0.23.2 |
0.9.6 (above) |
[ The version of Cadvisor and Influxdb do not correspond, you can see the 404 error in Cadvisor ]
2) The data of influxdb needs to be cleaned regularly . For a single Cadvisor, the data in half a day is 600M
#Unit: [hour: h] , [day: d] #Delete within an hour : delete from /^stats.*/ where time > now() - 1h #delete one hour away : delete from /^stats.*/ where time < now() - 1h |
3) Regarding the availability of influxdb, you can write scripts to regularly check whether there are related databases and tables , and create them if they are not.
#Check if there is a library curl -G 'http://192.168.16.100:8086/db?u=root&p=root&q=list+databases&pretty=true' curl -G 'http://192.168.16.100:8086/db?u=root&p=root&q=show+databases&pretty=true' #Check the table in a library [points part] curl -G 'http://192.168.16.100:8086/db/cadvisor/series?u=root&p=root&q=list+series&pretty=true' #Create library : Library name: cadvisor curl "http://www.perofu.com:8086/db?u=root&p=root" -d "{\"name\": \"cadvisor\"}" |
③, Grafana notes:
Grafana's data retrieval requires a lot of effort. You can check the relevant statements on the official website, or you can directly borrow other people's templates.
Influxdb query statement:
https://docs.influxdata.com/influxdb/v0.8/api/query_language/
④, Heapster notes:
For larger-scale k8s clusters, heapster's current cache method will eat up a lot of memory .
Because the container information of the entire cluster needs to be obtained regularly, it will be a problem to store the information in memory , and the heapter needs to support the API to obtain temporary metrics.
If heapster is run in pod mode, OOM is easy to occur. Therefore, it is currently recommended to turn off the cache and separate the k8s platform in a standalone way. It is recommended that each node run the container separately
The biggest advantage of heapster is that the monitoring data it captures can be grouped by pod, container, namespace, etc.
In this way, the monitoring information can be kept private, that is, each k8s user can only see the resource usage of his own application.
Heapster collects more data than cadvisor, but it is complete, and it stores less data in influxdb . Although it is Google's, it has different functions.
When the Heapster container is started alone , it willconnect to influxdband create a k8s database
There are two types of data metrics collected by heapster. [When searching for grafana, pay attention to it]
1), cumulative : the aggregated value is the [ cumulative value ], including the cpu usage time, network inflow and outflow,
2), gauge : Aggregate is [ instantaneous value ], including memory usage
Reference: https://github.com/kubernetes/heapster/blob/master/docs/storage-schema.md
|
describe |
Classification |
cpu/limit |
CPU preset value, yaml file can be set |
Instantaneous value |
cpu/node_reservation |
The cpu preset value of the kube node, similar to cpu/limit |
Instantaneous value |
cpu/node_utilization |
CPU utilization |
Instantaneous value |
cpu/request |
CPU requests resources, yaml file can be set |
Instantaneous value |
cpu/usage |
cpu usage |
Cumulative value |
cpu/usage_rate |
CPU usage rate |
Instantaneous value |
filesystem/limit |
file system restrictions |
Instantaneous value |
filesystem/usage |
文件系统使用 |
瞬时值 |
memory/limit |
内存限制,yaml文件可设置 |
瞬时值 |
memory/major_page_faults |
内存主分页错误 |
累计值 |
memory/major_page_faults_rate |
内存主分页错误速率 |
瞬时值 |
memory/node_reservation |
节点内存预设值 |
瞬时值 |
memory/node_utilization |
节点内存使用率 |
瞬时值 |
memory/page_faults |
内存分页错误 |
瞬时值 |
memory/page_faults_rate |
内存分页错误速率 |
瞬时值 |
memory/request |
内存申请,yaml文件可设置 |
瞬时值 |
memory/usage |
内存使用 |
瞬时值 |
memory/working_set |
内存工作使用 |
瞬时值 |
network/rx |
网络接收总流量 |
累计值 |
network/rx_errors |
网络接收错误数 |
不确定 |
network/rx_errors_rate |
网络接收错误数速率 |
瞬时值 |
network/rx_rate |
网络接收速率 |
瞬时值 |
network/tx |
网络发送总流量 |
累计值 |
network/tx_errors |
网络发送错误数 |
不确定 |
network/tx_errors_rate |
网络发送错误数速率 |
瞬时值 |
network/tx_rate |
网络发送速率 |
瞬时值 |
uptime |
容器启动时间,毫秒 |
瞬时值 |