1. 问题描述
Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refusedReadiness probe failed: 2023-05-04 22:13:23.706 [INFO][224] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.0.145,192.168.0.233,172.26.32.235
2. 环境信息
组件 | 版本 |
---|---|
Kubernetes | v1.24.2 |
Containerd | 1.6.18 |
Linux Kernel | 5.4 |
3. 问题分析
3.1 定位原因
发现 Kubernetes 容器集群中有一个节点出现 calico-node
异常的情况,查看该 Pod 的描述信息:
kubectl describe pod calico-node-hd7wm -n kube-system
提示 calico/node
连接 BIRDv4 socket 被拒绝。有网友反映是 calico 配置参数 IP_AUTODETECTION_METHOD
的值需要设置为实际网卡的网卡名称,于是检查配置:
- name: CLUSTER_TYPEvalue: "k8s,bgp"# Auto-detect the BGP IP address.- name: IPvalue: "autodetect"- name: IP_AUTODETECTION_METHODvalue: "interface=eth0"
发现 calico 的配置已经是实际的网卡名称,网卡信息如下:
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500inet 192.168.0.200 netmask 255.255.255.0 broadcast 192.168.0.255ether fa:16:3e:e9:41:0a txqueuelen 1000 (Ethernet)RX packets 951363626 bytes 577280343840 (537.6 GiB)RX errors 0 dropped 0 overruns 0 frame 0TX packets 967287474 bytes 178201446365 (165.9 GiB)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
查看 calico-node 在节点上的 bird 进程,发现 calico-node 在节点上的进程已经启动,于是猜测可能是这个进程已经假死。关于 bird 进程的更多信息请参考:基于 BGP 实现 Calico 的 IPIP 网络
[root@k8s-master1 cni]# netstat -ltnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:179 0.0.0.0:* LISTEN 2246613/bird
......
3.2 解决办法
- 干掉出问题的节点上 bird 进程,让
calico-node
自动重启一个新的 bird 进程。bird 进程号如上所示是:2246613
kill -9 2246613
- 删除问题节点上的
calico-node
Pod
kubectl delete pod calico-node-hd7wm -n kube-system
4. 结论
查看 calico-node 运行状态
kubectl get pods -A
calico-node 运行信息如下:
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-node-9zhv2 1/1 Running 5 (53d ago) 76d
kube-system calico-node-dnvlc 1/1 Running 0 4m1s
kube-system calico-node-pt9qp 1/1 Running 0 56d
kube-system calico-node-wzq2p 1/1 Running 0 56d
......
此时 calico-node
已经全部正常,刚才出问题的节点已经处于 Running
状态。查看之前出问题的节点上的 bird
进程状态
netstat -ltnp | grep bird
bird
进行信息如下:
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:179 0.0.0.0:* LISTEN 2253102/bird
......
bird
进程已经重新创建,新的进程号是 2253102。通过 kill bird 假死进程,重新生成新的 bird
进程解决了上述问题。