目录
一.问题分析
1.报错原因
服务器非正常关机(意外掉电、强制拔电)后 etcd 数据损坏。
我这报错与预先操作:
其中一个网络插件flannel直接丢失,重启集群所有docker与k8s,发现k8s组件启动不了(etcd,apiserver,controller-manager,scheduler),在把组件单独重启,发现还是有两个组件报错(etcd,apiserver),查看apiserver日志发现出现报错Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused,2379是etcd的端口,那么apiserver是由于etcd无法连接而启动不了。接着查看etcd日志发现报错mvcc: cannot unmarshal event: proto: wrong wireType = 0 for field Key此报错是由于服务器非正常关机(意外掉电,强制拔电)后 etcd数据损坏导致的
2.排错步骤
# 1.查看报错 ~]# kubectl get pods -n kube-system -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES flannel-xxx 1/1 Running 6 145d node2的ip node2的ip <none> <none> flannel-xxx 1/1 Running 7 145d node3的ip node3的ip <none> <none> ## 注:我这是三节点,现在只有两个 #在node1上查看容器,发现容器也没有了 ~]# docker ps -a | grep flannel ~]# docker ps | grep flannel # 2.备份日志(在所有节点执行) ~]# systemctl status docker.service -l > dockers.err.txt ~]# systemctl status kubelet.service -l > kubelet.err.txt ~]# journalctl -xefu docker > docker.err.txt # 3.重启kubelet和docker(在所有节点执行) ~]# systemctl restart docker ~]# systemctl restart kubelet ~]# kubectl get pods -n kube-system -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES flannel-xxx 1/1 Running 6 145d node1的ip node1的ip <none> <none> flannel-xxx 1/1 Running 6 145d node2的ip node2的ip <none> <none> flannel-xxx 1/1 Running 7 145d node3的ip node3的ip <none> <none> kube-apiserver-xx 0/1 Running 23 34d node{1..3}的ip node3{1..3}的ip <none> <none> kube-controller-manager-xx 0/1 Running 2 145d node{1..3}的ip node{1..3}的ip <none> <none> kube-scheduler-xx 0/1 Running 2 145d node{1..3}的ip node{1..3}的ip <none> <none> etcd-xxx 0/1 Running 10 145d node{1..3}的ip node{1..3}的ip <none> <none> # 注:此时发现网络插件恢复了,但是etcd,apiserver,controller-manager,schedule这四个组件报错了 # 3.单独在重启组件 ~]# kubectl get pods -n kube-system | grep -v "NAME" | grep "0/1" | awk '{print $1}'| xargs -i kubectl delete pods -n kube-system {} # 测试发现还是有两个报错 ~]# kubectl get pods -n kube-system -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kube-apiserver-xx 0/1 Running 23 34d node{1..3}的ip node3{1..3}的ip <none> <none> etcd-xxx 0/1 Running 10 145d node{1..3}的ip node{1..3}的ip <none> <none> # 4.查看日志 ~]# docker logs -f apiserver的CONTAINER ID xxxx Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused ~]# docker logs -f etcd的CONTAINER ID xxxx mvcc: cannot unmarshal event: proto: wrong wireType = 0 for field Key
二.恢复
1.备份etcd与
#1.不知道路径可以查找etcd数据目录 ~]# find / -type d -name member ~] cd /var/lib/etcd/member ~]# mv * /root/member.back # 注:在故障节点执行,删除或者移走备份都可以
2.恢复
#1.重启etcd ~]# docker ps -a|grep etcd ~]# docker rm -f etcd的ID # 所有节点执行 ~]# systemctl restart kubelet # 重启kubelet #2.查看 ~]# kubectl get pods # 注:此时没有报错了