k8s集群突然报错

目录

一.问题分析

1.报错原因

2.排错步骤

二.恢复

1.备份etcd与

2.恢复


一.问题分析

1.报错原因

       服务器非正常关机(意外掉电、强制拔电)后 etcd 数据损坏。

我这报错与预先操作:

      其中一个网络插件flannel直接丢失,重启集群所有docker与k8s,发现k8s组件启动不了(etcd,apiserver,controller-manager,scheduler),在把组件单独重启,发现还是有两个组件报错(etcd,apiserver),查看apiserver日志发现出现报错Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused,2379是etcd的端口,那么apiserver是由于etcd无法连接而启动不了。接着查看etcd日志发现报错mvcc: cannot unmarshal event: proto: wrong wireType = 0 for field Key此报错是由于服务器非正常关机(意外掉电,强制拔电)后 etcd数据损坏导致的

2.排错步骤

# 1.查看报错
~]# kubectl get pods -n kube-system -o wide
NAME        READY STATUS  RESTARTS  AGE   IP           NODE         NOMINATED   NODE     READINESS GATES
flannel-xxx 1/1   Running 6         145d  node2的ip    node2的ip     <none>    <none>
flannel-xxx 1/1   Running 7         145d  node3的ip    node3的ip     <none>    <none>

## 注:我这是三节点,现在只有两个
#在node1上查看容器,发现容器也没有了
~]# docker ps -a | grep flannel   
~]# docker ps | grep flannel

# 2.备份日志(在所有节点执行)
~]# systemctl status docker.service  -l  > dockers.err.txt
~]# systemctl status kubelet.service  -l  > kubelet.err.txt
~]# journalctl -xefu docker > docker.err.txt 

# 3.重启kubelet和docker(在所有节点执行)
~]# systemctl restart docker      
~]# systemctl restart kubelet     
~]# kubectl get pods -n kube-system -o wide
NAME        READY STATUS  RESTARTS  AGE   IP           NODE         NOMINATED   NODE     READINESS GATES
flannel-xxx 1/1   Running 6       145d  node1的ip    node1的ip     <none>    <none>
flannel-xxx 1/1   Running 6       145d  node2的ip    node2的ip     <none>    <none>
flannel-xxx 1/1   Running 7       145d  node3的ip    node3的ip     <none>    <none>
kube-apiserver-xx 0/1  Running 23 34d   node{1..3}的ip node3{1..3}的ip  <none>    <none>
kube-controller-manager-xx 0/1 Running 2 145d node{1..3}的ip node{1..3}的ip  <none>  <none>
kube-scheduler-xx 0/1  Running 2  145d        node{1..3}的ip node{1..3}的ip  <none>  <none>
etcd-xxx   0/1    Running 10      145d        node{1..3}的ip node{1..3}的ip  <none>  <none>
# 注:此时发现网络插件恢复了,但是etcd,apiserver,controller-manager,schedule这四个组件报错了

# 3.单独在重启组件
~]# kubectl get pods -n kube-system | grep -v "NAME" | grep "0/1" | awk '{print $1}'| xargs -i kubectl delete pods -n kube-system {}
# 测试发现还是有两个报错
~]# kubectl get pods -n kube-system -o wide
NAME        READY STATUS  RESTARTS  AGE   IP           NODE         NOMINATED   NODE     READINESS GATES
kube-apiserver-xx 0/1  Running 23 34d   node{1..3}的ip node3{1..3}的ip  <none>    <none>
etcd-xxx   0/1    Running 10      145d        node{1..3}的ip node{1..3}的ip  <none>  <none>

# 4.查看日志
~]# docker logs -f apiserver的CONTAINER ID
xxxx
Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused
~]# docker logs -f etcd的CONTAINER ID
xxxx
mvcc: cannot unmarshal event: proto: wrong wireType = 0 for field Key

二.恢复

1.备份etcd与

#1.不知道路径可以查找etcd数据目录
~]# find / -type d -name member
~] cd /var/lib/etcd/member
~]# mv * /root/member.back
# 注:在故障节点执行,删除或者移走备份都可以

2.恢复

#1.重启etcd
~]# docker ps -a|grep etcd
~]# docker rm -f etcd的ID        # 所有节点执行
~]# systemctl restart kubelet    # 重启kubelet

#2.查看
~]# kubectl get pods
# 注:此时没有报错了

猜你喜欢

转载自blog.csdn.net/kali_yao/article/details/126810964