How to fix data inconsistency on a certain node in K8S cluster etcd - The Road to Dreaming

Background information

  The k8s cluster installed in binary mode has 3 nodes in the etcd cluster. One day, a machine hung and could not log in remotely through SSH, so it was restarted directly by the administrator. After restarting, it was found that the k8s cluster deleted a deployment application and refreshed it multiple times. Yes, but not for a while, so I executed the etcd command on 3 nodes to query the data, and found that the deleted data of the application still existed on the restarted node, so I judged that the node in the etcd cluster had dirty data, and other nodes Data is out of sync.

Troubleshooting process

problem found

# 删除应用
kubectl  -n kube-system delete deploy metrics-server

# 检查应用状态
kubectl  -n kube-system get pod | grep metrics-server

此处多次查询发现一会存在,一会不存在

# 检查etcd节点状态

etcdctl  member  list

etcdctl --endpoints=https://192.168.100.100:2379,https://192.168.100.101:2379,https://192.168.100.102:2379 --write-out=table endpoint status

# 在每个节点上执行查询,找出问题节点

ETCDCTL_API=3 etcdctl  get /registry/deployments/kube-system/metrics-server

  From the above, we found the problem of inconsistent data of etcd cluster nodes. Although the problem node is stopped, the cluster can still be used normally, but this can only be a temporary method. If there are 2 nodes, if the leader cannot be elected, it will affect the cluster. Robustness and service reliability, therefore, we need to repair the etcd service of the problem node.

How to fix

1. Back up data

It is necessary to back up normal data before performing operations to avoid failure to restore and restore. This is very important, especially in a production environment.

Backup method:

a. Directly package the data directory

The main packaging directories are data wal.

b. etcd snapshot backup

I have written about it before, so I won’t repeat it here.

2. How to fix

1) Stop the etcd service of the problem node

systemctl stop  etcd

2) Clear the data directory

Mainly clear the data wal directory

3) Get the id of the problem node etcd

etcdctl member list

4) Remove the problem node from the cluster

etcdctl  member remove  <问题节点ID>

5) Re-add the problem node to the cluster

etcdctl [证书] --endpoints="https://192.168.100.100:2379,https://192.168.100.101:2379,https://192.168.100.102:2379" member add etcd-192.168.100.102 --peer-urls="https://192.168.100.102:2380"

 6) Modify the etcd configuration file: change the value of initial-cluster-state from new to existing

sed -i 's/new/existing/g' /etc/systemd/system/etcd.service

systemctl daemon-reload

 7) Start the service

systemctl  start etcd

systemctl  status etcd

 8) Check etcd cluster status

Guess you like

Origin blog.csdn.net/qq_34777982/article/details/134382557