During service deployment or operation, users may trigger high-risk operations at different levels, leading to service failures to varying degrees. In order to better help users estimate and avoid operational risks, this article will start from the cluster/node dimension, show users which high-risk operations will lead to what consequences, and provide users with corresponding misoperation solutions.
Cluster/node
Cluster and node high-risk operations
classification |
High-risk operations |
Lead to consequences |
Solution after misoperation |
---|---|---|---|
master node |
Modify the node security group in the cluster |
May cause the master node to be unusable Description:Naming rule: cluster name-cce-control-random number |
Refer to the security group of the newly created cluster to repair and release the security group. |
Node expires or is destroyed |
The master node is unavailable |
Unrecoverable. |
|
Reinstall the operating system |
The master component is deleted |
Unrecoverable. |
|
Upgrade master or etcd component version by yourself |
May make the cluster unusable |
Go back to the original version. |
|
Delete or format core directory data such as node /etc/kubernetes |
The master node is unavailable |
Unrecoverable. |
|
Change node IP |
The master node is unavailable |
Change back to the original IP. |
|
Modify the parameters of core components (etcd, kube-apiserver, docker, etc.) by yourself |
May cause the master node to be unavailable |
Restore according to the recommended configuration parameters. For details, see Configuration Management . |
|
Replace the master or etcd certificate by yourself |
May cause the cluster to be unavailable |
Unrecoverable. |
|
worker node |
Modify the node security group in the cluster |
May cause the node to be unusable Description:Naming rule: cluster name-cce-node-random number |
Refer to the security group of the newly created cluster to repair and release the security group. |
Node is deleted |
The node is unavailable |
Unrecoverable. |
|
Reinstall the operating system |
The node component is deleted and the node is unavailable |
Reset the node. For details, see Reset the node . |
|
Upgrade node kernel |
May cause the node to be unusable or the network abnormal Description:The CCE cluster depends on the system kernel version. If it is not necessary, please do not use yum update to update or reinstall the node's operating system kernel (reinstalling using the original image or other images is a high-risk operation) |
For EulerOS 2.2 recovery methods, please refer to How to solve the problem of container network unavailability caused by yum update upgrading the operating system? For non-EulerOS 2.2, you can reset the node. For details, see Reset Node . |
|
Change node IP |
Node unavailable |
Change back to the original IP. |
|
Modify the parameters of core components (kubelet, kube-proxy, etc.) by yourself |
May cause the node to be unavailable, modify the security-related configuration to cause the component to be unsafe, etc. |
Recover according to the recommended configuration parameters. For details, see Operation Scenarios . |
|
Modify operating system configuration |
May cause the node to be unavailable |
Try to restore configuration items or reset the node. For details, see Reset Node . |
|
Delete /opt, /var/paas directory, delete data disk |
Node unavailable |
Reset the node. For details, see Reset the node . |
|
Modify the directory permissions in the node, container directory permissions, etc. |
Abnormal permissions |
It is not recommended to modify, please restore it yourself. |
|
Disk format or partition the node |
Node unavailable |
Reset the node. For details, see Reset the node . |
|
Install your own other software on the node |
Cause the Kubernetes component installed on the node to be abnormal, the node status becomes unavailable, and the workload cannot be deployed to this node |
Uninstall the installed software and try to restore or reset the node. For details, see Resetting the Node . |
Network and load balancing
Network and load balancing
High-risk operations |
Lead to consequences |
Solution after misoperation |
---|---|---|
Modify the kernel parameter net.ipv4.ip_forward=0 |
network issue |
Modify the kernel parameter to net.ipv4.ip_forward=1 |
Modify the kernel parameter net.ipv4.tcp_tw_recycle=1 |
Cause nat abnormal |
Modify the kernel parameter net.ipv4.tcp_tw_recycle=0 |
The node security group configuration does not allow port 53 udp of the container CIDR |
DNS in the cluster is not working |
Refer to the security group of the newly created cluster to repair and release the security group. |
Create a custom listener in the ELB managed by CCE through the ELB console |
The modification is reset by the CCE side |
The listener is automatically created through the yaml of the service. |
Binding custom backend rs to the ELB managed by CCE through the ELB console |
It is forbidden to manually bind the backend rs. |
|
通过ELB的控制台修改CCE管理的ELB的证书 |
通过ingress的yaml来自动管理证书。 |
|
通过ELB的控制台修改CCE管理的ELB监听器名称 |
禁止修改CCE管理的ELB监听器名称。 |
日志
高危操作 |
导致后果 |
误操作后解决方案 |
---|---|---|
删除宿主机 /tmp/ccs-log-collector/pos 目录 |
日志重复采集 |
无 |
删除宿主机 /tmp/ccs-log-collector/buffer 目录 |
日志丢失 |
无 |
云硬盘
云硬盘
高危操作 |
导致后果 |
误操作后解决方案 |
备注 |
---|---|---|---|
控制台手动解挂EVS |
Pod写入报io error |
删掉node上mount目录,重新调度Pod |
Pod里面的文件记录了文件的采集位置 |
节点上umount磁盘挂载路径 |
Pod写入本地磁盘 |
重新mount对应目录到Pod中 |
Buffer里面是待消费的日志缓存文件 |
节点上直接操作EVS |
Pod写入本地磁盘 |
无 |
无 |