The mental process of dealing with k8s and calico unable to allocate podIP


Once again secretly resolved a major accident that may occur. If you don't want to see the process, you can skip to the end to see the solution.

a network error



One day, I built a test application on kplcloud. After the build was completed, I found that the new pod kept failing to start, and the following error message was thrown:

Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "xxxxxx-fc4cb949f-gpkm2_xxxxxxx" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input

The operation and maintenance classmates who know k8s are not there, what should I do if there is a problem suddenly?

Try to start solving the problem.

1. Is it possible that the mirror pull failed and started to find the problem:

  1. Log in to all servers in the cluster to see if the space is full (however, it is not full)

  2. Query the network status of all servers in the cluster (there is no problem)

  3. Try starting another pod? (can't get up)

This is embarrassing..., is there a problem with calico?

2. View the server error message

Try the following command to see the error message from the server:

$ journalctl -exf

There are indeed some error messages:

This error is too broad, keep trying to find the problem elsewhere.

At this point, I have started to think about how to run away...

Would it be possible to try restarting? 

Too much risk to take. Although restarting can solve most problems in many cases, restarting docker and k8s is not the best choice in this case.

Continue to search the log, guess it is the problem of inability to assign IP, then the target turns to calico

Find problems from calico-node

Check whether the IP pool is used up.

Use the calicoamd command to query whether calico is running normally

$ calicoctl get ippools -o wide
CIDR NAT IPIP
172.20.0.0/16 true false

$ calicoctl node status

It seems to be no problem.

Start off-site help...  

fruitless

Since calico-node is running normally, it should not be a problem with calico-etcd.

try calico-etcd

In line with the attitude of checking and trying if you have any doubts, let's start a show operation on calico-etcd.

In order to reduce the amount of code and facilitate reading, the following certificates and endpoints that need to be added to etcdctl will not be added one by one, just refer to:
ETCDCTL_API=3 etcdctl --cacert=/etc/etcd/ssl/ca.pem \
--cert=/etc/etcd/ssl/etcd.pem \
--key=/etc/etcd/ssl/etcd-key.pem \
--endpoints=http://10.xx.xx.1:2379,http://10.xx.xx.2:2379,http://10.xx.xx.3:2379

There is no problem with calico, try to see if the etcd used by calico is normal, and enter the calico-etcd cluster:

$ ETCDCTL_API=3 etcdctl member list
bde98346d77cfa1: name=node-1 peerURLs=http://10.xx.xx.1:2380 clientURLs=http://10.xx.xx.1:2379 isLeader=true
299fcfbf514069ed: name=node-2 peerURLs=http://10.xx.xx.2:2380 clientURLs=http://10.xx.xx.2:2379 isLeader=false
954e5cdb2d25c491: name=node-3 peerURLs=http://10.xx.xx.3:2380 clientURLs=http://10.xx.xx.3:2379 isLeader=false

It seems that the cluster is also running fine and get data is fine.

Everything seemed so normal and there seemed to be nothing wrong with it.

Forget it, forget it, let’s write a resume first, and change your mind.

那尝试向ETCD写入一条数据试试?

$ ETCDCTL_API=3 etcdctl put /hello world

Error: etcdserver: mvcc: database space exceeded

✨报了一个错:

Error: etcdserver: mvcc: database space exceeded???

似乎是找到原因了,既然定位到问题所在,那接下来就好办了。(不用跑路了(⁎⁍̴̛ᴗ⁍̴̛⁎))把简历先放一放。

感谢google,我从etcd官网找到了一些线索及解决方案,后面我贴上官网介绍,先解决问题:

使用etcdctl endpoint status查询etcd各个节点的使用状态:

$ ETCDCTL_API=3 etcdctl endpoint status
http://10.xx.xx.1:2379, 299fcfbf514069ed, 3.2.18, 2.1 GB, false, 7, 8701663
http://10.xx.xx.2:2379, bde98346d77cfa1, 3.2.18, 2.1 GB, true, 7, 8701683
http://10.xx.xx.3:2379, 954e5cdb2d25c491, 3.2.18, 2.1 GB, false, 7, 8701687

上面可以看到集群空间已经使用了2.1GB了,这个值需要留意一下。

查询etcd是否有告警信息使用命令etcdctl alarm list:

$ ETCDCTL_API=3 etcdctl alarm list
memberID:2999344297460918765 alarm:NOSPACE

显示了一个alerm:NOSPACE,这个表示没空间了,那是没什么空间呢?磁盘还是内存?先查询一下。

似乎磁盘、内存空间都足够的。从官网的信息了解到应该是etcd配额的问题,Etcd v3 的默认的 backend quota 2GB,也就是说etcd默认最大的配额是2GB,如果超过了则无法再写入数据,要么把旧数据删除,要么把数据压缩了。

参考官方的解决方案

ETCD官网参考:https://etcd.io/docs/v3.2.17/op-guide/maintenance/

  1. 获取etcd的旧版本号

    $ ETCDCTL_API=3 etcdctl endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*'
    5395771
    5395771
    5395771
  2. 压缩旧版本

    $ ETCDCTL_API=3 etcdctl compact 5395771
    compacted revision 5395771
  3. 整理碎片

    $ ETCDCTL_API=3 etcdctl defrag
    Finished defragmenting etcd member[http://10.xx.xx.1:2379]
    Finished defragmenting etcd member[http://10.xx.xx.2:2379]
    Finished defragmenting etcd member[http://10.xx.xx.3:2379]
  4. 关闭告警

    $ ETCDCTL_API=3 etcdctl alarm disarm
    memberID:2999344297460918765 alarm:NOSPACE

    $ ETCDCTL_API=3 etcdctl alarm list
  5. 测试数据是否可写入

    $ ETCDCTL_API=3 etcdctl put /hello world
    OK

    $ ETCDCTL_API=3 etcdctl get /hello
    OK

回到k8s这边,删除那个失败的pod,并查看是否可正常分配ip。

一切正确,完美。

为了避免后续再出现类似问题,需要设置自动压缩,启动自动压缩功能需要在etcd启动参考上加上xxxxx=1

https://skyao.gitbooks.io/learning-etcd3/content/documentation/op-guide/maintenance.html

etcd默认不会自动 compact,需要设置启动参数,或者通过命令进行compact,如果变更频繁建议设置,否则会导致空间和内存的浪费以及错误。Etcd v3 的默认的 backend quota 2GB,如果不 compact,boltdb 文件大小超过这个限制后,就会报错:”Error: etcdserver: mvcc: database space exceeded”,导致数据无法写入。

产生这么多垃圾数据的原因就是因为频繁的调度,我们集群有大量CronJob在执行,并且执行的非常活跃,每次产生新的Pod都会被分配到ip。有可能是因为pod时间太短或没有及时注销而导致calico-etcd产生了大量垃圾数据。

尾巴




因calico-etcd集群的的使用配额满了,在创建pod时calico所分配的IP无法写入到etcd里,从而导致pod创建失败也就无法注册到CoreDNS了。

为了不踩坑,监控是非常重要的,我们有etcd集群的监控,却忽略了etcd配额的监控,幸运的是当时并没有应用重启动或升级,没有造成损失。

最后的建议就是,没事上去点点,说不定会有您意想不到的惊喜(惊吓)。


◆ ◆ ◆ ◆ 

如需转载请与小助手(微信号:creditease_tech)联系。发现文章有错误、对内容有疑问,都可以通过关注宜信技术学院微信公众号(CE_TECH),在后台留言给我们。我们每周会挑选出一位热心小伙伴,送上一份精美的小礼品。快来扫码关注我们吧!

注:文章封面原图素材来源于网络,若有侵权请留言删除。


点击“阅读原文”查看更多技术干货


本文分享自微信公众号 - 宜信技术学院(CE_TECH)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324169721&siteId=291194637