Due to the special nature of docker container, traditional zabbix unable to monitor the status of docker within k8s cluster, so it is necessary to monitor the use of prometheus:
What is Prometheus?
Prometheus is developed SoundCloud open when monitoring alarm system and sequence database (TSDB). Prometheus use the Go language development, it is Google BorgMon monitoring system open source version.
2016 initiated by Google Linux Foundation, the Foundation's native cloud (Cloud Native Computing Foundation), the second Prometheus into its next big open source projects.
Prometheus is currently very active in the open source community.
Prometheus and Heapster (Heapster is a subproject K8S for obtaining performance data clusters.) Compared to function better and more comprehensive. Prometheus performance enough to support tens of thousands of cluster size.
Prometheus features
- Multi-dimensional data model.
- Flexible query language.
- Not rely on distributed storage, a single server node is autonomous.
- By way of the data capture timing pull HTTP based.
- When sequence data push can be performed by an intermediate gateway.
- Static configuration or through service discovery to find the target clients.
- It supports a variety of charts and interface display, such as Grafana and so on.
Fundamental
Prometheus is a basic principle of the HTTP protocol crawl status periodically monitored component, any component as long as the corresponding HTTP interface can monitor access. Without any SDK or other integration process. This is very suitable for a virtualized environment monitoring system, such as VM, Docker, Kubernetes and so on. Output monitored component information HTTP interface called exporter. At present the Internet company most commonly used components are exporter can be used directly, such as Varnish, Haproxy, Nginx, MySQL, Linux system information (including disk, memory, CPU, network, etc.).
Service process
- Prometheus Daemon is responsible for the timing crawl metrics (indicators) data on away goals, goals need to expose each grab an http service interface to its regular crawl. Prometheus supported through configuration files, text files, Zookeeper, Consul, DNS SRV Lookup etc. specified crawl target. Prometheus monitored using PULL manner, i.e. through the target server PULL data directly or indirectly through intermediate gateways to Push data.
- Prometheus in all locally stored data to crawl, and clean up and organize data by certain rules, and the results obtained are stored in a new time series.
- Prometheus impression data collected by PromQL API and other visually. Prometheus support many ways chart visualization, for example Grafana, comes Promdash and template engine and so on itself provides. Prometheus also provides the HTTP API query, customize the output required.
- Client Support PushGateway active push metrics to PushGateway, but the timing to fetch the data Prometheus on Gateway.
- Alertmanager is a component independent of Prometheus, Prometheus can support the query, provide a very flexible alarm mode.
Three Suite
- Server is responsible for data collection and storage, providing support PromQL query language.
- Alertmanager alert manager for alarm.
- Push Gateway supports the initiative to push Temporary Job intermediate gateways indicators.
prometheus different from zabbix, no agent, is used for different services exporter:
prometheus official website: official website address
Under normal circumstances, the monitoring k8s cluster and node, pod, there are four common exporter:
- kube-state-metrics - the cluster master & etcd collected k8s basic status information
- node-exporter - collecting cluster node information k8s
- cadvisor - Internal collection k8s cluster docker container to use information resources
- blackbox-exporte - collect k8s cluster docker container service is alive
Then one by one to create more than exporter:
Old routines, download docker mirror, preparing a list of resources, application resource configuration list:
一、kube-state-metrics
# docker pull quay.io/coreos/kube-state-metrics:v1.5.0
# docker tag 91599517197a harbor.od.com/public/kube-state-metrics:v1.5.0 # docker push harbor.od.com/public/kube-state-metrics:v1.5.0
Preparing a list of resources:
1, rbac.yaml
# mkdir /data/k8s-yaml/kube-state-metrics && cd /data/k8s-yaml/kube-state-metrics
apiVersion: v1 kind: ServiceAccount metadata: labels: addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/cluster-service: "true" name: kube-state-metrics namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/cluster-service: "true" name: kube-state-metrics rules: - apiGroups: - "" resources: - configmaps - secrets - nodes - pods - services - resourcequotas - replicationcontrollers - limitranges - persistentvolumeclaims - persistentvolumes - namespaces - endpoints verbs: - list - watch - apiGroups: - policy resources: - poddisruptionbudgets verbs: - list - watch - apiGroups: - extensions resources: - daemonsets - deployments - replicasets verbs: - list - watch - apiGroups: - apps resources: - statefulsets verbs: - list - watch - apiGroups: - batch resources: - cronjobs - jobs verbs: - list - watch - apiGroups: - autoscaling resources: - horizontalpodautoscalers verbs: - list - watch --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: labels: addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/cluster-service: "true" name: kube-state-metrics roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metrics subjects: - kind: ServiceAccount name: kube-state-metrics namespace: kube-system
2、dp.yaml
apiVersion: extensions/v1beta1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "2" labels: grafanak8sapp: "true" app: kube-state-metrics name: kube-state-metrics namespace: kube-system spec: selector: matchLabels: grafanak8sapp: "true" app: kube-state-metrics strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: labels: grafanak8sapp: "true" app: kube-state-metrics spec: containers: - name: kube-state-metrics image: harbor.od.com/public/kube-state-metrics:v1.5.0 imagePullPolicy: IfNotPresent ports: - containerPort: 8080 name: http-metrics protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: /healthz port: 8080 scheme: HTTP initialDelaySeconds: 5 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 serviceAccountName: kube-state-metrics
Application resource configuration list:
# kubectl apply -f http://k8s-yaml.od.com/kube-state-metrics/rbac.yaml # kubectl apply -f http://k8s-yaml.od.com/kube-state-metrics/dp.yaml
have a test:
# kubectl get pod -n kube-system -o wide
# curl http://172.7.22.10:8080/healthz
已经成功运行。
二、node-exporter
由于node-exporter是监控node的,所有需要每个节点启动一个,所以使用ds控制器
# docker pull prom/node-exporter:v0.15.0
# docker tag 12d51ffa2b22 harbor.od.com/public/node-exporter:v0.15.0 # docker push harbor.od.com/public/node-exporter:v0.15.0
准备资源配置清单:
1、ds.yaml
# mkdir node-exporter && cd node-exporter
kind: DaemonSet apiVersion: extensions/v1beta1 metadata: name: node-exporter namespace: kube-system labels: daemon: "node-exporter" grafanak8sapp: "true" spec: selector: matchLabels: daemon: "node-exporter" grafanak8sapp: "true" template: metadata: name: node-exporter labels: daemon: "node-exporter" grafanak8sapp: "true" spec: volumes: - name: proc hostPath: path: /proc type: "" - name: sys hostPath: path: /sys type: "" containers: - name: node-exporter image: harbor.od.com/public/node-exporter:v0.15.0 imagePullPolicy: IfNotPresent args: - --path.procfs=/host_proc - --path.sysfs=/host_sys ports: - name: node-exporter hostPort: 9100 containerPort: 9100 protocol: TCP volumeMounts: - name: sys readOnly: true mountPath: /host_sys - name: proc readOnly: true mountPath: /host_proc hostNetwork: true
应用资源配置清单:
# kubectl apply -f http://k8s-yaml.od.com/node-exporter/ds.yaml
# kubectl get pod -n kube-system -o wide
我们有两个node,每个node节点启动一个:
三、cadvisor
# docker pull google/cadvisor:v0.28.3
# docker tag 75f88e3ec333 harbor.od.com/public/cadvisor:0.28.3
# docker push harbor.od.com/public/cadvisor:0.28.3
准备资源配置清单:
# mkdir cadvisor && cd cadvisor
1、ds.yaml 标红部分是k8s资源配置清单中一个重要的高级属性,下一篇博客着重介绍
apiVersion: apps/v1 kind: DaemonSet metadata: name: cadvisor namespace: kube-system labels: app: cadvisor spec: selector: matchLabels: name: cadvisor template: metadata: labels: name: cadvisor spec: hostNetwork: true tolerations: - key: node-role.kubernetes.io/master effect: NoSchedule containers: - name: cadvisor image: harbor.od.com/public/cadvisor:v0.28.3 imagePullPolicy: IfNotPresent volumeMounts: - name: rootfs mountPath: /rootfs readOnly: true - name: var-run mountPath: /var/run - name: sys mountPath: /sys readOnly: true - name: docker mountPath: /var/lib/docker readOnly: true ports: - name: http containerPort: 4194 protocol: TCP readinessProbe: tcpSocket: port: 4194 initialDelaySeconds: 5 periodSeconds: 10 args: - --housekeeping_interval=10s - --port=4194 terminationGracePeriodSeconds: 30 volumes: - name: rootfs hostPath: path: / - name: var-run hostPath: path: /var/run - name: sys hostPath: path: /sys - name: docker hostPath: path: /data/docker
针对挂载资源,做一些调整:
# mount -o remount,rw /sys/fs/cgroup/
# ln -s /sys/fs/cgroup/cpu,cpuacct /sys/fs/cgroup/cpuacct,cpu
应用资源配置清单:
# kubectl apply -f http://k8s-yaml.od.com/cadvisor/ds.yaml
检查:
四、blackbox-exporter
# docker pull prom/blackbox-exporter:v0.15.1
# docker tag 81b70b6158be harbor.od.com/public/blackbox-exporter:v0.15.1 # docker push harbor.od.com/public/blackbox-exporter:v0.15.1
创建资源配置清单:
1、cm.yaml
apiVersion: v1 kind: ConfigMap metadata: labels: app: blackbox-exporter name: blackbox-exporter namespace: kube-system data: blackbox.yml: |- modules: http_2xx: prober: http timeout: 2s http: valid_http_versions: ["HTTP/1.1", "HTTP/2"] valid_status_codes: [200,301,302] method: GET preferred_ip_protocol: "ip4" tcp_connect: prober: tcp timeout: 2s
2、dp.yaml
kind: Deployment apiVersion: extensions/v1beta1 metadata: name: blackbox-exporter namespace: kube-system labels: app: blackbox-exporter annotations: deployment.kubernetes.io/revision: 1 spec: replicas: 1 selector: matchLabels: app: blackbox-exporter template: metadata: labels: app: blackbox-exporter spec: volumes: - name: config configMap: name: blackbox-exporter defaultMode: 420 containers: - name: blackbox-exporter image: harbor.od.com/public/blackbox-exporter:v0.15.1 imagePullPolicy: IfNotPresent args: - --config.file=/etc/blackbox_exporter/blackbox.yml - --log.level=info - --web.listen-address=:9115 ports: - name: blackbox-port containerPort: 9115 protocol: TCP resources: limits: cpu: 200m memory: 256Mi requests: cpu: 100m memory: 50Mi volumeMounts: - name: config mountPath: /etc/blackbox_exporter readinessProbe: tcpSocket: port: 9115 initialDelaySeconds: 5 timeoutSeconds: 5 periodSeconds: 10 successThreshold: 1 failureThreshold: 3
3、svc.yaml
kind: Service apiVersion: v1 metadata: name: blackbox-exporter namespace: kube-system spec: selector: app: blackbox-exporter ports: - name: blackbox-port protocol: TCP port: 9115
4、ingress.yaml
apiVersion: extensions/v1beta1 kind: Ingress metadata: name: blackbox-exporter namespace: kube-system spec: rules: - host: blackbox.od.com http: paths: - path: / backend: serviceName: blackbox-exporter servicePort: blackbox-port
这里用到了一个域名,添加解析:
# vi /var/named/od.com.zone
blackbox A 10.4.7.10
应用资源配置清单:
# kubectl apply -f http://k8s-yaml.od.com/blackbox-exporter/cm.yaml # kubectl apply -f http://k8s-yaml.od.com/blackbox-exporter/dp.yaml # kubectl apply -f http://k8s-yaml.od.com/blackbox-exporter/svc.yaml # kubectl apply -f http://k8s-yaml.od.com/blackbox-exporter/ingress.yaml
访问域名测试:
访问到以下界面,表示blackbox已经运行成功
接下来部署prometheus server:
# docker pull prom/prometheus:v2.14.0
# docker tag 7317640d555e harbor.od.com/infra/prometheus:v2.14.0 # docker push harbor.od.com/infra/prometheus:v2.14.0
准备资源配置清单:
1、rbac.yaml
apiVersion: v1 kind: ServiceAccount metadata: labels: addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/cluster-service: "true" name: prometheus namespace: infra --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/cluster-service: "true" name: prometheus rules: - apiGroups: - "" resources: - nodes - nodes/metrics - services - endpoints - pods verbs: - get - list - watch - apiGroups: - "" resources: - configmaps verbs: - get - nonResourceURLs: - /metrics verbs: - get --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: labels: addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/cluster-service: "true" name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: infra
2、dp.yaml
加上--web.enable-lifecycle启用远程热加载配置文件
调用指令是curl -X POST http://localhost:9090/-/reload
storage.tsdb.min-block-duration=10m #只加载10分钟数据到内
storage.tsdb.retention=72h #保留72小时数据
apiVersion: extensions/v1beta1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "5" labels: name: prometheus name: prometheus namespace: infra spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 7 selector: matchLabels: app: prometheus strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 1 type: RollingUpdate template: metadata: labels: app: prometheus spec: containers: - name: prometheus image: harbor.od.com/infra/prometheus:v2.14.0 imagePullPolicy: IfNotPresent command: - /bin/prometheus args: - --config.file=/data/etc/prometheus.yml - --storage.tsdb.path=/data/prom-db - --storage.tsdb.min-block-duration=10m - --storage.tsdb.retention=72h - --web.enable-lifecycle ports: - containerPort: 9090 protocol: TCP volumeMounts: - mountPath: /data name: data resources: requests: cpu: "1000m" memory: "1.5Gi" limits: cpu: "2000m" memory: "3Gi" imagePullSecrets: - name: harbor securityContext: runAsUser: 0 serviceAccountName: prometheus volumes: - name: data nfs: server: hdss7-200 path: /data/nfs-volume/prometheus
3、svc.yaml
apiVersion: v1 kind: Service metadata: name: prometheus namespace: infra spec: ports: - port: 9090 protocol: TCP targetPort: 9090 selector: app: prometheus
4、ingress.yaml
apiVersion: extensions/v1beta1 kind: Ingress metadata: annotations: kubernetes.io/ingress.class: traefik name: prometheus namespace: infra spec: rules: - host: prometheus.od.com http: paths: - path: / backend: serviceName: prometheus servicePort: 9090
这里用到一个域名,添加解析:
prometheus A 10.4.7.10
记得重启named服务
创建需要的目录:
# mkdir -p /data/nfs-volume/prometheus/{etc,prom-db}
修改prometheus配置文件:别问为啥这么写,问就是不懂~
# vi /data/nfs-volume/prometheus/etc/prometheus.yml
global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'etcd' tls_config: ca_file: /data/etc/ca.pem cert_file: /data/etc/client.pem key_file: /data/etc/client-key.pem scheme: https static_configs: - targets: - '10.4.7.12:2379' - '10.4.7.21:2379' - '10.4.7.22:2379' - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - job_name: 'kubernetes-kubelet' kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __address__ replacement: ${1}:10255 - job_name: 'kubernetes-cadvisor' kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __address__ replacement: ${1}:4194 - job_name: 'kubernetes-kube-state' kubernetes_sd_configs: - role: pod relabel_configs: - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - source_labels: [__meta_kubernetes_pod_label_grafanak8sapp] regex: .*true.* action: keep - source_labels: ['__meta_kubernetes_pod_label_daemon', '__meta_kubernetes_pod_node_name'] regex: 'node-exporter;(.*)' action: replace target_label: nodename - job_name: 'blackbox_http_pod_probe' metrics_path: /probe kubernetes_sd_configs: - role: pod params: module: [http_2xx] relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme] action: keep regex: http - source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port, __meta_kubernetes_pod_annotation_blackbox_path] action: replace regex: ([^:]+)(?::\d+)?;(\d+);(.+) replacement: $1:$2$3 target_label: __param_target - action: replace target_label: __address__ replacement: blackbox-exporter.kube-system:9115 - source_labels: [__param_target] target_label: instance - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - job_name: 'blackbox_tcp_pod_probe' metrics_path: /probe kubernetes_sd_configs: - role: pod params: module: [tcp_connect] relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme] action: keep regex: tcp - source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __param_target - action: replace target_label: __address__ replacement: blackbox-exporter.kube-system:9115 - source_labels: [__param_target] target_label: instance - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - job_name: 'traefik' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme] action: keep regex: traefik - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name
拷贝配置文件中用到的证书:
# cd /data/nfs-volume/prometheus/etc/
# cp /opt/certs/ca.pem ./ # cp /opt/certs/client.pem ./ # cp /opt/certs/client-key.pem ./
应用资源配置清单:
# kubectl apply -f http://k8s-yaml.od.com/prometheus-server/rbac.yaml # kubectl apply -f http://k8s-yaml.od.com/prometheus-server/dp.yaml # kubectl apply -f http://k8s-yaml.od.com/prometheus-server/svc.yaml # kubectl apply -f http://k8s-yaml.od.com/prometheus-server/ingress.yaml
浏览器验证:prometheus.od.com
这里点击status-targets,这里展示的就是我们在prometheus.yml中配置的job-name,这些targets基本可以满足我们收集数据的需求。
点击status-configuration就是我们的配置文件
我们在配置文件中,除了etcd使用的静态配置以外,其他job都是使用的自动发现。
静态配置:
global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'etcd' tls_config: ca_file: /data/etc/ca.pem cert_file: /data/etc/client.pem key_file: /data/etc/client-key.pem scheme: https static_configs: - targets: - '10.4.7.12:2379' - '10.4.7.21:2379' - '10.4.7.22:2379'
自动发现:自动发现资源是pod
- job_name: 'blackbox_http_pod_probe' metrics_path: /probe kubernetes_sd_configs: - role: pod params: module: [http_2xx] relabel_configs: