不得不说千万不要随意更改版本,我用的1.13的版本,然后学到这一步时,还因yaml文件不同,卡住了很久,然后各种google才找到解决办法
https://www.linuxea.com/2112.html
以前是用heapster来收集资源指标才能看,现在heapster要废弃了。
从k8s v1.8开始后,引入了新的功能,即把资源指标引入api。
资源指标:metrics-server
自定义指标: prometheus,k8s-prometheus-adapter
因此,新一代架构:
1) 核心指标流水线:由kubelet、metrics-server以及由API server提供的api组成;cpu累计利用率、内存实时利用率、pod的资源占用率及容器的磁盘占用率
2) 监控流水线:用于从系统收集各种指标数据并提供终端用户、存储系统以及HPA,他们包含核心指标以及许多非核心指标。非核心指标不能被k8s所解析。
metrics-server是个api server,仅仅收集cpu利用率、内存利用率等。
[root@master ~]# kubectl api-versions admissionregistration.k8s.io/v1beta1 apiextensions.k8s.io/v1beta1 apiregistration.k8s.io/v1 apiregistration.k8s.io/v1beta1 apps/v1 apps/v1beta1 apps/v1beta2 authentication.k8s.io/v1 authentication.k8s.io/v1beta1 authorization.k8s.io/v1
访问 https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/metrics-server 获取yaml文件,但这个里面的yaml文件更新了。和视频内的有差别
贴出我修改后的yaml文件,留作备用
[root@master metrics-server]# cat auth-delegator.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: metrics-server:system:auth-delegator labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:auth-delegator subjects: - kind: ServiceAccount name: metrics-server namespace: kube-system
[root@master metrics-server]# cat auth-reader.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: metrics-server-auth-reader namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: extension-apiserver-authentication-reader subjects: - kind: ServiceAccount name: metrics-server namespace: kube-system
[root@master metrics-server]# cat metrics-apiservice.yaml apiVersion: apiregistration.k8s.io/v1beta1 kind: APIService metadata: name: v1beta1.metrics.k8s.io labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile spec: service: name: metrics-server namespace: kube-system group: metrics.k8s.io version: v1beta1 insecureSkipTLSVerify: true groupPriorityMinimum: 100 versionPriority: 100
关键是这个文件
[root@master metrics-server]# cat metrics-server-deployment.yaml apiVersion: v1 kind: ServiceAccount metadata: name: metrics-server namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile --- apiVersion: v1 kind: ConfigMap metadata: name: metrics-server-config namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists data: NannyConfiguration: |- apiVersion: nannyconfig/v1alpha1 kind: NannyConfiguration --- apiVersion: apps/v1 kind: Deployment metadata: name: metrics-server-v0.3.1 namespace: kube-system labels: k8s-app: metrics-server kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile version: v0.3.1 spec: selector: matchLabels: k8s-app: metrics-server version: v0.3.1 template: metadata: name: metrics-server labels: k8s-app: metrics-server version: v0.3.1 annotations: scheduler.alpha.kubernetes.io/critical-pod: '' seccomp.security.alpha.kubernetes.io/pod: 'docker/default' spec: priorityClassName: system-cluster-critical serviceAccountName: metrics-server containers: - name: metrics-server image: mirrorgooglecontainers/metrics-server-amd64:v0.3.1 command: - /metrics-server - --metric-resolution=30s - --kubelet-insecure-tls - --kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP # These are needed for GKE, which doesn't support secure communication yet. # Remove these lines for non-GKE clusters, and when GKE supports token-based auth. #- --kubelet-port=10250 #- --deprecated-kubelet-completely-insecure=true ports: - containerPort: 443 name: https protocol: TCP - name: metrics-server-nanny image: mirrorgooglecontainers/addon-resizer:1.8.4 resources: limits: cpu: 100m memory: 300Mi requests: cpu: 5m memory: 50Mi env: - name: MY_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: MY_POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace volumeMounts: - name: metrics-server-config-volume mountPath: /etc/config command: - /pod_nanny - --config-dir=/etc/config - --cpu=100m - --extra-cpu=0.5m - --memory=100Mi - --extra-memory=50Mi - --threshold=5 - --deployment=metrics-server-v0.3.1 - --container=metrics-server - --poll-period=300000 - --estimator=exponential # Specifies the smallest cluster (defined in number of nodes) # # resources will be scaled to. - --minClusterSize=10 volumes: - name: metrics-server-config-volume configMap: name: metrics-server-config tolerations: - key: "CriticalAddonsOnly" operator: "Exists"
[root@master metrics-server]# cat metrics-server-service.yaml apiVersion: v1 kind: Service metadata: name: metrics-server namespace: kube-system labels: addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/cluster-service: "true" kubernetes.io/name: "Metrics-server" spec: selector: k8s-app: metrics-server ports: - port: 443 protocol: TCP targetPort: https
[root@master metrics-server]# cat metrics-server-service.yaml apiVersion: v1 kind: Service metadata: name: metrics-server namespace: kube-system labels: addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/cluster-service: "true" kubernetes.io/name: "Metrics-server" spec: selector: k8s-app: metrics-server ports: - port: 443 protocol: TCP targetPort: https [root@master metrics-server]# cat resource-reader.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: system:metrics-server labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile rules: - apiGroups: - "" resources: - pods - nodes - namespaces - nodes/stats verbs: - get - list - watch - apiGroups: - "extensions" resources: - deployments verbs: - get - list - update - watch --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: system:metrics-server labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:metrics-server subjects: - kind: ServiceAccount name: metrics-server namespace: kube-system
如果从github上下载以上文件apply出错,就用上面的metrics-server-deployment.yaml文件,删掉重新apply一下就可以了
[root@master metrics-server]# kubectl apply -f ./
[root@master ~]# kubectl proxy --port=8080
确保metrics-server-v0.3.1-76b796b-4xgvp是running状态,我当时出现了Error发现是yaml里面有问题,最后该掉running了,该来该去该到上面的最终版
[root@master metrics-server]# kubectl get pods -n kube-system NAME READY STATUS RESTARTS AGE canal-mgbc2 3/3 Running 12 3d23h canal-s4xgb 3/3 Running 23 3d23h canal-z98bc 3/3 Running 15 3d23h coredns-78d4cf999f-5shdq 1/1 Running 0 6m4s coredns-78d4cf999f-xj5pj 1/1 Running 0 5m53s etcd-master 1/1 Running 13 17d kube-apiserver-master 1/1 Running 13 17d kube-controller-manager-master 1/1 Running 19 17d kube-flannel-ds-amd64-8xkfn 1/1 Running 0 <invalid> kube-flannel-ds-amd64-t7jpc 1/1 Running 0 <invalid> kube-flannel-ds-amd64-vlbjz 1/1 Running 0 <invalid> kube-proxy-ggcbf 1/1 Running 11 17d kube-proxy-jxksd 1/1 Running 11 17d kube-proxy-nkkpc 1/1 Running 12 17d kube-scheduler-master 1/1 Running 19 17d kubernetes-dashboard-76479d66bb-zr4dd 1/1 Running 0 <invalid> metrics-server-v0.3.1-76b796b-4xgvp 2/2 Running 0 9s
查看出错日志 -c指定容器名,该pod内有两个容器,metrcis-server只是其中一个,另一个查询方法一样,把名字改掉即可
[root@master metrics-server]# kubectl logs metrics-server-v0.3.1-76b796b-4xgvp -c metrics-server -n kube-system
大致出错的日志内容如下几条;
403 Forbidden", response: "Forbidden (user=system:anonymous, verb=get, resource=nodes, subresource=stats) E0903 1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:<hostname>: unable to fetch metrics from Kubelet <hostname> (<hostname>): Get https://<hostname>:10250/stats/summary/: dial tcp: lookup <hostname> on 10.96.0.10:53: no such host no response from https://10.101.248.96:443: Get https://10.101.248.96:443: Proxy Error ( Connection refused ) E1109 09:54:49.509521 1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:linuxea.node-2.com: unable to fetch metrics from Kubelet linuxea.node-2.com (10.10.240.203): Get https://10.10.240.203:10255/stats/summary/: dial tcp 10.10.240.203:10255: connect: connection refused, unable to fully scrape metrics from source kubelet_summary:linuxea.node-3.com: unable to fetch metrics from Kubelet linuxea.node-3.com (10.10.240.143): Get https://10.10.240.143:10255/stats/summary/: dial tcp 10.10.240.143:10255: connect: connection refused, unable to fully scrape metrics from source kubelet_summary:linuxea.node-4.com: unable to fetch metrics from Kubelet linuxea.node-4.com (10.10.240.142): Get https://10.10.240.142:10255/stats/summary/: dial tcp 10.10.240.142:10255: connect: connection refused, unable to fully scrape metrics from source kubelet_summary:linuxea.master-1.com: unable to fetch metrics from Kubelet linuxea.master-1.com (10.10.240.161): Get https://10.10.240.161:10255/stats/summary/: dial tcp 10.10.240.161:10255: connect: connection refused, unable to fully scrape metrics from source kubelet_summary:linuxea.node-1.com: unable to fetch metrics from Kubelet linuxea.node-1.com (10.10.240.202): Get https://10.10.240.202:10255/stats/summary/: dial tcp 10.10.240.202:10255: connect: connection refused]
当时我按照网上的方法尝试修改coredns配置,结果搞的日志出现获取所有pod都unable,如下,然后又取消掉了修改,删掉了coredns,让他自己重新生成了俩新的coredns容器
- --kubelet-insecure-tls
这种方式是禁用tls验证,一般不建议在生产环境中使用。并且由于DNS是无法解析到这些主机名,使用- --kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP
进行规避。还有另外一种方法,修改coredns,不过,我并不建议这样做。
参考这篇:https://github.com/kubernetes-incubator/metrics-server/issues/131
metrics-server unable to fetch pdo metrics for pod
以上为遇到的问题,反正用我上面的yaml绝对保证解决以上所有问题。还有那个flannel改了directrouting之后为啥每次重启集群机器,他就失效呢,我不得不在删掉flannel然后重新生成,这个问题前面文章写到了。
此时执行如下命令就都成功了,item里也有值了
[root@master ~]# curl http://localhost:8080/apis/metrics.k8s.io/v1beta1 { "kind": "APIResourceList", "apiVersion": "v1", "groupVersion": "metrics.k8s.io/v1beta1", "resources": [ { "name": "nodes", "singularName": "", "namespaced": false, "kind": "NodeMetrics", "verbs": [ "get", "list" ] }, { "name": "pods", "singularName": "", "namespaced": true, "kind": "PodMetrics", "verbs": [ "get", "list" ] } ]
[root@master metrics-server]# curl http://localhost:8080/apis/metrics.k8s.io/v1beta1/pods | more % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 14868 0 14868 0 0 1521k 0 --:--:-- --:--:-- --:--:-- 1613k { "kind": "PodMetricsList", "apiVersion": "metrics.k8s.io/v1beta1", "metadata": { "selfLink": "/apis/metrics.k8s.io/v1beta1/pods" }, "items": [ { "metadata": { "name": "pod1", "namespace": "prod", "selfLink": "/apis/metrics.k8s.io/v1beta1/namespaces/prod/pods/pod1", "creationTimestamp": "2019-01-29T02:39:12Z" },
[root@master metrics-server]# kubectl top pods NAME CPU(cores) MEMORY(bytes) filebeat-ds-4llpp 1m 2Mi filebeat-ds-dv49l 1m 5Mi myapp-0 0m 1Mi myapp-1 0m 2Mi myapp-2 0m 1Mi myapp-3 0m 1Mi myapp-4 0m 2Mi
[root@master metrics-server]# kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% master 206m 5% 1377Mi 72% node1 88m 8% 534Mi 28% node2 78m 7% 935Mi 49%
自定义指标(prometheus)
大家看到,我们的metrics已经可以正常工作了。不过,metrics只能监控cpu和内存,对于其他指标如用户自定义的监控指标,metrics就无法监控到了。这时就需要另外一个组件叫prometheus。
prometheus的部署非常麻烦。
node_exporter是agent;
PromQL相当于sql语句来查询数据;
k8s-prometheus-adapter:prometheus是不能直接解析k8s的指标的,需要借助k8s-prometheus-adapter转换成api
kube-state-metrics是用来整合数据的。
下面开始部署。