过程背景
收到反馈 dns 解析异常,手动 dig 客户端响应超时:
dig test.com @ip# 省略部分
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
过一段时间后 dig 恢复正常,dns 服务在故障过程中并没有重启过,我怀疑是公网网络出了问题,不是 dns 服务程序本身的问题。所以提出再出现故障,上服务器 dig 127.0.0.1 看解析是否正常。
没想当晚问题马上又来了,立刻上服务器 dig,本以为自信认为不会出任何问题的,结果打脸了,心里瞬间慌了。
dig test.com @127.0.0.1../../../../lib/isc/unix/socket.c:2104: internal_send: 127.0.0.1#53: Invalid argument
../../../../lib/isc/unix/socket.c:2104: internal_send: 127.0.0.1#53: Invalid argument; <<>> DiG 9.10.6 <<>> test.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59732
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0;; QUESTION SECTION:
;test.com. IN A;; ANSWER SECTION:
test.com. 3600 IN A 127.0.0.1;; Query time: 16 msec
;; SERVER: 192.168.50.1#53(192.168.50.1)
;; WHEN: Mon Apr 03 23:36:57 CST 2023
;; MSG SIZE rcvd: 42
现象是,dig 执行后没有很快的响应,等待了一下手输出,并且在正常的信息前报了两个错误:
internal_send: 127.0.0.1#53: Invalid argument
查看内核日志
dmesg | head -10[529571.395313] neighbour: arp_cache: neighbor table overflow!
[529571.416911] neighbour: arp_cache: neighbor table overflow!
[529571.416915] neighbour: arp_cache: neighbor table overflow!
查看当前 arp 记录数
arp -an | wc -l796
查看 arp gc 阀值
sysctl -a | grep gc_threshnet.ipv4.neigh.default.gc_thresh1 = 128
net.ipv4.neigh.default.gc_thresh2 = 512
net.ipv4.neigh.default.gc_thresh3 = 1024
net.ipv6.neigh.default.gc_thresh1 = 128
net.ipv6.neigh.default.gc_thresh2 = 512
net.ipv6.neigh.default.gc_thresh3 = 1024
gc_thresh1: 少于这个值,gc 不会启动
gc_thresh2: ARP表的最多纪录的软限制,允许超过该数字5秒
gc_thresh3: ARP表的最多纪录的硬限制,大于该数目,gc立即启动,并强制回收
解决方案
调整节点内核参数,将 arp cache 的 gc 阀值调高 (/etc/sysctl.conf):
# Setup DNS threshold for arp
net.ipv4.neigh.default.gc_thresh3 = 8192
net.ipv4.neigh.default.gc_thresh2 = 4096
net.ipv4.neigh.default.gc_thresh1 = 1024
sysctl -p 使其生效
相关资料
https://www.cnblogs.com/tencent-cloud-native/p/14481570.html