Background information
The k8s cluster installed in binary mode has 3 nodes in the etcd cluster. One day, a machine hung and could not log in remotely through SSH, so it was restarted directly by the administrator. After restarting, it was found that the k8s cluster deleted a deployment application and refreshed it multiple times. Yes, but not for a while, so I executed the etcd command on 3 nodes to query the data, and found that the deleted data of the application still existed on the restarted node, so I judged that the node in the etcd cluster had dirty data, and other nodes Data is out of sync.
Troubleshooting process
problem found
# 删除应用
kubectl -n kube-system delete deploy metrics-server
# 检查应用状态
kubectl -n kube-system get pod | grep metrics-server
此处多次查询发现一会存在,一会不存在
# 检查etcd节点状态
etcdctl member list
etcdctl --endpoints=https://192.168.100.100:2379,https://192.168.100.101:2379,https://192.168.100.102:2379 --write-out=table endpoint status
# 在每个节点上执行查询,找出问题节点
ETCDCTL_API=3 etcdctl get /registry/deployments/kube-system/metrics-server
From the above, we found the problem of inconsistent data of etcd cluster nodes. Although the problem node is stopped, the cluster can still be used normally, but this can only be a temporary method. If there are 2 nodes, if the leader cannot be elected, it will affect the cluster. Robustness and service reliability, therefore, we need to repair the etcd service of the problem node.
How to fix
1. Back up data
It is necessary to back up normal data before performing operations to avoid failure to restore and restore. This is very important, especially in a production environment.
Backup method:
a. Directly package the data directory
The main packaging directories are data wal.
b. etcd snapshot backup
I have written about it before, so I won’t repeat it here.
2. How to fix
1) Stop the etcd service of the problem node
systemctl stop etcd
2) Clear the data directory
Mainly clear the data wal directory
3) Get the id of the problem node etcd
etcdctl member list
4) Remove the problem node from the cluster
etcdctl member remove <问题节点ID>
5) Re-add the problem node to the cluster
etcdctl [证书] --endpoints="https://192.168.100.100:2379,https://192.168.100.101:2379,https://192.168.100.102:2379" member add etcd-192.168.100.102 --peer-urls="https://192.168.100.102:2380"
6) Modify the etcd configuration file: change the value of initial-cluster-state from new to existing
sed -i 's/new/existing/g' /etc/systemd/system/etcd.service
systemctl daemon-reload
7) Start the service
systemctl start etcd
systemctl status etcd