HUAWEI CLOUD-Cloud Container Engine (CCE)-High-risk operations and solutions

During service deployment or operation, users may trigger high-risk operations at different levels, leading to service failures to varying degrees. In order to better help users estimate and avoid operational risks, this article will start from the cluster/node dimension, show users which high-risk operations will lead to what consequences, and provide users with corresponding misoperation solutions.

Cluster/node

Cluster and node high-risk operations

classification

High-risk operations

Lead to consequences

Solution after misoperation

master node

Modify the node security group in the cluster

May cause the master node to be unusable

Description:

Naming rule: cluster name-cce-control-random number

Refer to the security group of the newly created cluster to repair and release the security group.

Node expires or is destroyed

The master node is unavailable

Unrecoverable.

Reinstall the operating system

The master component is deleted

Unrecoverable.

Upgrade master or etcd component version by yourself

May make the cluster unusable

Go back to the original version.

Delete or format core directory data such as node /etc/kubernetes

The master node is unavailable

Unrecoverable.

Change node IP

The master node is unavailable

Change back to the original IP.

Modify the parameters of core components (etcd, kube-apiserver, docker, etc.) by yourself

May cause the master node to be unavailable

Restore according to the recommended configuration parameters. For details, see Configuration Management .

Replace the master or etcd certificate by yourself

May cause the cluster to be unavailable

Unrecoverable.

worker node

Modify the node security group in the cluster

May cause the node to be unusable

Description:

Naming rule: cluster name-cce-node-random number

Refer to the security group of the newly created cluster to repair and release the security group.

Node is deleted

The node is unavailable

Unrecoverable.

Reinstall the operating system

The node component is deleted and the node is unavailable

Reset the node. For details, see Reset the node .

Upgrade node kernel

May cause the node to be unusable or the network abnormal

Description:

The CCE cluster depends on the system kernel version. If it is not necessary, please do not use yum update to update or reinstall the node's operating system kernel (reinstalling using the original image or other images is a high-risk operation)

For EulerOS 2.2 recovery methods, please refer to How to solve the problem of container network unavailability caused by yum update upgrading the operating system?

For non-EulerOS 2.2, you can reset the node. For details, see Reset Node .

Change node IP

Node unavailable

Change back to the original IP.

Modify the parameters of core components (kubelet, kube-proxy, etc.) by yourself

May cause the node to be unavailable, modify the security-related configuration to cause the component to be unsafe, etc.

Recover according to the recommended configuration parameters. For details, see Operation Scenarios .

Modify operating system configuration

May cause the node to be unavailable

Try to restore configuration items or reset the node. For details, see Reset Node .

Delete /opt, /var/paas directory, delete data disk

Node unavailable

Reset the node. For details, see Reset the node .

Modify the directory permissions in the node, container directory permissions, etc.

Abnormal permissions

It is not recommended to modify, please restore it yourself.

Disk format or partition the node

Node unavailable

Reset the node. For details, see Reset the node .

Install your own other software on the node

Cause the Kubernetes component installed on the node to be abnormal, the node status becomes unavailable, and the workload cannot be deployed to this node

Uninstall the installed software and try to restore or reset the node. For details, see Resetting the Node .

Network and load balancing

Network and load balancing

High-risk operations

Lead to consequences

Solution after misoperation

Modify the kernel parameter net.ipv4.ip_forward=0

network issue

Modify the kernel parameter to net.ipv4.ip_forward=1

Modify the kernel parameter net.ipv4.tcp_tw_recycle=1

Cause nat abnormal

Modify the kernel parameter net.ipv4.tcp_tw_recycle=0

The node security group configuration does not allow port 53 udp of the container CIDR

DNS in the cluster is not working

Refer to the security group of the newly created cluster to repair and release the security group.

Create a custom listener in the ELB managed by CCE through the ELB console

The modification is reset by the CCE side

The listener is automatically created through the yaml of the service.

Binding custom backend rs to the ELB managed by CCE through the ELB console

It is forbidden to manually bind the backend rs.

通过ELB的控制台修改CCE管理的ELB的证书

通过ingress的yaml来自动管理证书。

通过ELB的控制台修改CCE管理的ELB监听器名称

禁止修改CCE管理的ELB监听器名称。

日志

高危操作

导致后果

误操作后解决方案

删除宿主机 /tmp/ccs-log-collector/pos 目录

日志重复采集

删除宿主机 /tmp/ccs-log-collector/buffer 目录

日志丢失

云硬盘

云硬盘

高危操作

导致后果

误操作后解决方案

备注

控制台手动解挂EVS

Pod写入报io error

删掉node上mount目录,重新调度Pod

Pod里面的文件记录了文件的采集位置

节点上umount磁盘挂载路径

Pod写入本地磁盘

重新mount对应目录到Pod中

Buffer里面是待消费的日志缓存文件

节点上直接操作EVS

Pod写入本地磁盘

Guess you like

Origin blog.csdn.net/KH_FC/article/details/111468242