Kubernetes 集群中某个节点出现 Error querying BIRD: unable to connect to BIRDv4 socket

1. 问题描述

Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused

Readiness probe failed: 2023-05-04 22:13:23.706 [INFO][224] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.0.145,192.168.0.233,172.26.32.235

2. 环境信息

组件 版本
Kubernetes v1.24.2
Containerd 1.6.18
Linux Kernel 5.4

3. 问题分析

3.1 定位原因

发现 Kubernetes 容器集群中有一个节点出现 calico-node异常的情况,查看该 Pod 的描述信息:

kubectl describe pod calico-node-hd7wm -n kube-system

提示 calico/node 连接 BIRDv4 socket 被拒绝。有网友反映是 calico 配置参数 IP_AUTODETECTION_METHOD 的值需要设置为实际网卡的网卡名称,于是检查配置:

            - name: CLUSTER_TYPE
              value: "k8s,bgp"
            # Auto-detect the BGP IP address.
            - name: IP
              value: "autodetect"
            - name: IP_AUTODETECTION_METHOD
              value: "interface=eth0"

发现 calico 的配置已经是实际的网卡名称,网卡信息如下:

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.0.200  netmask 255.255.255.0  broadcast 192.168.0.255
        ether fa:16:3e:e9:41:0a  txqueuelen 1000  (Ethernet)
        RX packets 951363626  bytes 577280343840 (537.6 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 967287474  bytes 178201446365 (165.9 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

查看 calico-node 在节点上的 bird 进程,发现 calico-node 在节点上的进程已经启动,于是猜测可能是这个进程已经假死。关于 bird 进程的更多信息请参考:基于 BGP 实现 Calico 的 IPIP 网络

[root@k8s-master1 cni]# netstat -ltnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:179             0.0.0.0:*               LISTEN      2246613/bird        
......

3.2 解决办法

  • 干掉出问题的节点上 bird 进程,让 calico-node 自动重启一个新的 bird 进程。bird 进程号如上所示是:2246613
kill -9 2246613
  • 删除问题节点上的 calico-node Pod
kubectl delete  pod calico-node-hd7wm  -n kube-system

4. 结论

查看 calico-node 运行状态

kubectl get pods -A

calico-node 运行信息如下:

NAMESPACE              NAME                                        READY   STATUS    RESTARTS        AGE
kube-system            calico-node-9zhv2                           1/1     Running   5 (53d ago)     76d
kube-system            calico-node-dnvlc                           1/1     Running   0               4m1s
kube-system            calico-node-pt9qp                           1/1     Running   0               56d
kube-system            calico-node-wzq2p                           1/1     Running   0               56d
......

在这里插入图片描述

此时 calico-node 已经全部正常,刚才出问题的节点已经处于 Running 状态。查看之前出问题的节点上的 bird 进程状态

netstat -ltnp | grep bird

bird 进行信息如下:

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:179             0.0.0.0:*               LISTEN      2253102/bird        
......

bird 进程已经重新创建,新的进程号是 2253102。通过 kill bird 假死进程,重新生成新的 bird 进程解决了上述问题。

猜你喜欢

转载自blog.csdn.net/hzwy23/article/details/130498534
今日推荐