故障描述
calico-kube-controllers 异常,不断重启
日志信息如下
2023-02-21 01:26:47.085 [INFO][1] main.go 92: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W0221 01:26:47.086980 1 client_config.go:615] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
2023-02-21 01:26:47.087 [INFO][1] main.go 113: Ensuring Calico datastore is initialized
2023-02-21 01:26:47.106 [INFO][1] main.go 153: Getting initial config snapshot from datastore
2023-02-21 01:26:47.120 [INFO][1] main.go 156: Got initial config snapshot
2023-02-21 01:26:47.120 [INFO][1] watchersyncer.go 89: Start called
2023-02-21 01:26:47.120 [INFO][1] main.go 173: Starting status report routine
2023-02-21 01:26:47.120 [INFO][1] main.go 182: Starting Prometheus metrics server on port 9094
2023-02-21 01:26:47.120 [INFO][1] main.go 418: Starting controller ControllerType="Node"
2023-02-21 01:26:47.120 [INFO][1] watchersyncer.go 127: Sending status update Status=wait-for-ready
2023-02-21 01:26:47.120 [INFO][1] node_syncer.go 65: Node controller syncer status updated: wait-for-ready
2023-02-21 01:26:47.120 [INFO][1] watchersyncer.go 147: Starting main event processing loop
2023-02-21 01:26:47.120 [INFO][1] watchercache.go 174: Full resync is required ListRoot="/calico/ipam/v2/assignment/"
2023-02-21 01:26:47.120 [INFO][1] node_controller.go 143: Starting Node controller
2023-02-21 01:26:47.121 [INFO][1] watchercache.go 174: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2023-02-21 01:26:47.121 [INFO][1] resources.go 349: Main client watcher loop
2023-02-21 01:26:47.121 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied
2023-02-21 01:26:47.121 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied
2023-02-21 01:26:47.121 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied
2023-02-21 01:26:47.121 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied
2023-02-21 01:26:47.124 [INFO][1] watchercache.go 271: Sending synced update ListRoot="/calico/ipam/v2/assignment/"
2023-02-21 01:26:47.125 [INFO][1] watchersyncer.go 127: Sending status update Status=resync
2023-02-21 01:26:47.125 [INFO][1] node_syncer.go 65: Node controller syncer status updated: resync
2023-02-21 01:26:47.125 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2023-02-21 01:26:47.125 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied
2023-02-21 01:26:47.125 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied
2023-02-21 01:26:47.129 [INFO][1] watchercache.go 271: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2023-02-21 01:26:47.129 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied
2023-02-21 01:26:47.129 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied
2023-02-21 01:26:47.129 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2023-02-21 01:26:47.129 [INFO][1] watchersyncer.go 221: All watchers have sync'd data - sending data and final sync
2023-02-21 01:26:47.129 [INFO][1] watchersyncer.go 127: Sending status update Status=in-sync
2023-02-21 01:26:47.129 [INFO][1] node_syncer.go 65: Node controller syncer status updated: in-sync
2023-02-21 01:26:47.137 [INFO][1] hostendpoints.go 90: successfully synced all hostendpoints
2023-02-21 01:26:47.221 [INFO][1] node_controller.go 159: Node controller is now running
2023-02-21 01:26:47.226 [INFO][1] ipam.go 69: Synchronizing IPAM data
2023-02-21 01:26:47.236 [INFO][1] ipam.go 78: Node and IPAM data is in sync
定位问题在这里
Failed to write status error=open /status/status.json: permission denied
进入容器检查目录
尝试进入容器,但是该容器居然没 cat , ls 等常规命令,无法查看容器问题
检查配置
查看pod的配置,对比其它集群,没任何问题,一样的
[grg@i-A8259010 ~]$ kubectl describe pod calico-kube-controllers-9f49b98f6-njs2f -n kube-system
Name: calico-kube-controllers-9f49b98f6-njs2f
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: 10.254.39.2/10.254.39.2
Start Time: Thu, 16 Feb 2023 11:14:35 +0800
Labels: k8s-app=calico-kube-controllerspod-template-hash=9f49b98f6
Annotations: cni.projectcalico.org/podIP: 10.244.29.73/32cni.projectcalico.org/podIPs: 10.244.29.73/32
Status: Running
IP: 10.244.29.73
IPs:IP: 10.244.29.73
Controlled By: ReplicaSet/calico-kube-controllers-9f49b98f6
Containers:calico-kube-controllers:Container ID: docker://21594e3517a3fc8ffc5224496cec373117138acf5417d9a335a1c5e80e0c3802Image: registry.custom.local:12480/kubeadm-ha/calico_kube-controllers:v3.19.1Image ID: docker-pullable://registry.cn-beijing.aliyuncs.com/dotbalo/kube-controllers@sha256:2ff71ba65cd7fe10e183ad80725ad3eafb59899d6f1b2610446b90c84bf2425aPort: <none>Host Port: <none>State: WaitingReason: CrashLoopBackOffLast State: TerminatedReason: ErrorExit Code: 2Started: Tue, 21 Feb 2023 09:34:06 +0800Finished: Tue, 21 Feb 2023 09:35:15 +0800Ready: FalseRestart Count: 1940Liveness: exec [/usr/bin/check-status -l] delay=10s timeout=1s period=10s #success=1 #failure=6Readiness: exec [/usr/bin/check-status -r] delay=0s timeout=1s period=10s #success=1 #failure=3Environment:ENABLED_CONTROLLERS: nodeDATASTORE_TYPE: kubernetesMounts:/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-55jbn (ro)
Conditions:Type StatusInitialized TrueReady FalseContainersReady FalsePodScheduled True
Volumes:kube-api-access-55jbn:Type: Projected (a volume that contains injected data from multiple sources)TokenExpirationSeconds: 3607ConfigMapName: kube-root-ca.crtConfigMapOptional: <nil>DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly op=Existsnode-role.kubernetes.io/master:NoSchedulenode.kubernetes.io/not-ready:NoExecute op=Exists for 300snode.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:Type Reason Age From Message---- ------ ---- ---- -------Warning Unhealthy 31m (x15164 over 4d22h) kubelet Readiness probe failed: Failed to read status file /status/status.json: unexpected end of JSON inputWarning BackOff 6m23s (x23547 over 4d22h) kubelet Back-off restarting failed containerWarning Unhealthy 79s (x11571 over 4d22h) kubelet Liveness probe failed: Failed to read status file /status/status.json: unexpected end of JSON input
对比镜像
检查镜像版本,与其它集群一致,没问题
Image: registry.custom.local:12480/kubeadm-ha/calico_kube-controllers:v3.19.1
Image ID: docker-pullable://registry.cn-beijing.aliyuncs.com/dotbalo/kube-controllers@sha256:2ff71ba65cd7fe10e183ad80725ad3eafb59899d6f1b2610446b90c84bf2425a
检查其余集群配置差异
检查与其它集群的配置信息,该机器的 docker 是原来已经安装的,版本是 19,其它机器是新安装的版本 20 。
处理方案
在无法重装 docker 的情况下
重启 pod,无效
百度,无相关信息
调整 calico-kube-controllers 配置
配置文件在 /etc/kubernetes/plugins/network-plugin/calico-typha.yaml
我们针对无法写入目录 /status ,添加卷映射
应用配置
mkdir /var/run/calico/status
chmod 777/var/run/calico/status
kubectl apply -f /etc/kubernetes/plugins/network-plugin/calico-typha.yaml
到此系统恢复