Manually install the program - Ali cloud solutions (deployment fails):
https://www.jianshu.com/p/1c7ddf18e8b2
Manually install the program - (successful deployment, but only the CPU memory and other monitoring information, there is no monitoring information GPU's):
https://github.com/camilb/prometheus-kubernetes/tree/master
helm installation program --GPU-Monitoring-tools solutions (deployment success):
Reference: http: //fly-luck.github.io/2018/12/10/gpu-monitoring-tools%20Prometheus/
对应github:https://github.com/NVIDIA/gpu-monitoring-tools/tree/helm-charts
- gpu-monitoring-tools (hereinafter referred to as GMT) comprises a metrics acquisition several sets of programs:
- NVML Go Bindings(C API)。
- DCGM exporter(Prometheus metrics on DCGM)。
- gmt monitoring framework provides several sets of programs:
- The direct use of Prometheus DaemonSet DCGM exporter, only the collection and monitoring.
- Prometheus Operator + Kube-Prometheus (modified by Nvidia), comprising a complete acquisition, monitoring, alarm, and other graphical components.
We use the second set of monitoring framework program, and the function of this program are still valid for no GPU machine's CPU.
Proven, this program can be monitored simultaneously host hardware (CPU, GPU, memory, disk, etc.), Kubernetes core components (apiserver, controller-manager, scheduler, etc.) as well as business services on Kubernetes run the operation.
What is the Operator
- For stateless applications, resources (e.g. Deployment) the native Kubernetes well receiving support automatic scaling, automatic restart and upgrade.
- For stateful applications, such as databases, caching, monitoring systems, the need for different operation and maintenance operation according to the particular application.
- Operator The operation and maintenance operations into the software package for a specific application, and Kubernetes API extended by a third-party resources, allowing users to create, configure, manage applications, typically includes a series Kubernetes collection of the CRD.
- Similar Resource Controller and the correspondence between the Kubernetes, Operator Controller according to a request presented to the user, the actual number of instances and the instance state is maintained at the same effects as desired by the user, but many of the details of the operation of the package.
Pre-preparation
Mirroring
The following image imported to all nodes of the cluster:
1 |
# If you use the original kubernetes build Prometheus, using two mirrors to create a resource, but can only obtain metrics, no docking alarm monitoring |
helm template
Download and unzip the helm following template:
1 |
wget https://nvidia.github.io/gpu-monitoring-tools/helm-charts/kube-prometheus-0.0.43.tgz |
installation steps
1. Configuration
Node labels
The need for monitoring of GPU node marked with labels.
1 |
kubectl label no <nodename> hardware-type=NVIDIAGPU |
Exogenous etcd
For etcd exogenous, i.e., in a manner not etcd container with Kubernetes cluster initialization starts, but outside the cluster prior to start etcd, etcd need to specify the address of the cluster.
Exogenous etcd assumed as an IP etcd0, etcd1, etcd2, external access port 2379, direct access using HTTP.
1 |
vim kube-prometheus/charts/exporter-kube-etcd/values.yaml |
1 |
#etcdPort: 4001 |
Meanwhile, the need to insert the chart data grafana, Note:
- Add a line at 465 "."
- Before the line 465 to
"title": "Crashlooping Control Plane Pods"
the panel. - Add the following line at 465 to keep the indentation.
1
vim kube-prometheus/charts/grafana/dashboards/kubernetes-cluster-status-dashboard.json
1 |
{ |
暴露端口
暴露prometheus、alertmanager、grafana的访问端口,以备排障。这些端口需要能从开发VPC直接访问。
1 |
vim kube-prometheus/values.yaml |
1 |
alertmanager: |
1 |
vim kube-prometheus/charts/grafana/values.yaml |
1 |
service: |
告警接收器
配置告警接收器,通常我们选择在同一个集群内的ControlCenter Service来接收,并将告警信息转换格式后转发给IMS。
1 |
vim kube-prometheus/values.yaml |
1 |
alertmanager: |
告警规则
平台监控包括Node硬件(CPU、内存、磁盘、网络、GPU)、K8s组件(Kube-Controller-Manager、Kube-Scheduler、Kubelet、API Server)、K8s应用(Deployment、StatefulSet、Pod)等。
由于篇幅较长,因此将监控告警规则放在附录。
2. 启动
1 |
cd prometheus-operator |
1 |
cd kube-prometheus |
3. 清理
1 |
helm delete --purge kube-prometheus |
1 |
helm delete --purge prometheus-operator |
常见问题
无法暴露Kubelet metrics
在1.13.0版本的kubernetes未出现此问题。
- 对于1.13.0之前的版本,需将获取kubelet metrics的方式由https改为http,否则Prometheus的kubelet targets将down掉。[github issue 926]
1
vim kube-prometheus/charts/exporter-kubelets/templates/servicemonitor.yaml
1 |
spec: |
- 验证
在Prometheus页面可以看到kubelet target。
无法暴露controller-manager及scheduler的metrics
方法一
针对Kubernetes v1.13.0。
-
将下述内容添加到kubeadm.conf,并在kubeadm初始化时kubeadm init –config kubeadm.conf。
1
2
3
4
5
6
7
8
9
10apiVersion: kubeadm.k8s.io/v1alpha3
kind: ClusterConfiguration
kubernetesVersion: 1.13.0
networking:
podSubnet: 10.244.0.0/16
controllerManagerExtraArgs:
address: 0.0.0.0
schedulerExtraArgs:
address: 0.0.0.0
... -
为pod打上label。
1
2
3
4kubectl get po -n kube-system
kubectl -n kube-system label po kube-controller-manager-<nodename> k8s-app=kube-controller-manager
kubectl -n kube-system label po kube-scheduler-<nodename> k8s-app=kube-scheduler
kubectl get po -n kube-system --show-labels -
验证
在Prometheus页面可以看到kube-controller-manager及kube-scheduler两个target。
在grafana页面可以看到controller-manager及scheduler的状态监控。
方法二
guide
针对1.13.0之前的Kubernetes。
- 修改kubeadm的核心配置。
1
kubeadm config view
将上述输出保存为newConfig.yaml,并添加以下两行:
1 |
controllerManagerExtraArgs: |
应用新配置:
1 |
kubeadm config upload from-file --config newConfig.yaml |
-
为pod打上label。
1
2
3
4kubectl get po -n kube-system
kubectl label po kube-controller-manager-<nodename> k8s-app=kube-controller-manager
kubectl label po kube-scheduler-<nodename> k8s-app=kube-scheduler
kubectl get po -n kube-system --show-labels -
重建exporters。
1
kubectl -n kube-system get svc
可以看到以下两个没有CLUSTER-IP的Service:
1 |
kube-prometheus-exporter-kube-controller-manager |
1 |
kubectl -n kube-system get svc kube-prometheus-exporter-kube-controller-manager -o yaml |
将上述输出分别保存为newKubeControllerManagerSvc.yaml和newKubeSchedulerSvc.yaml,删除一些非必要信息(如uid、selfLink、resourceVersion、creationTimestamp等)后重建。
1 |
kubectl delete -n kube-system svc kube-prometheus-exporter-kube-controller-manager kube-prometheus-exporter-kube-scheduler |
-
确保Prometheus pod到kube-controller-manager和kube-scheduler的NodePort 10251/10252的访问是通畅的。
-
验证与方法一相同。
无法暴露coredns
在Kubernetes v1.13.0中,集群DNS组件默认为coredns,因此需修改kube-prometheus的配置,才能监控到DNS服务的状态。
方法一
- 修改配置中的selectorLabel值与coredns的pod标签对应。
1
2
3kubectl -n kube-system get po --show-labels | grep coredns
# 输出
coredns k8s-app=kube-dns
1 |
vim kube-prometheus/charts/exporter-coredns/values.yaml |
1 |
#selectorLabel: coredns |
-
重启kube-prometheus。
1
2helm delete --purge kube-prometheus
helm install --name kube-prometheus --namespace monitoring kube-prometheus -
验证
在Prometheus可以看到kube-dns target。
方法二
-
修改pod的标签与配置中的一致。
1
kubectl -n kube-system label po
-
验证与方法一相同。
部署成功后需要使用port-forward才能访问到grafana面板,可视化看到监控效果:
https://blog.csdn.net/aixiaoyang168/article/details/81661459