过程背景

收到反馈 dns 解析异常，手动 dig 客户端响应超时：

dig test.com @ip# 省略部分
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

过一段时间后 dig 恢复正常，dns 服务在故障过程中并没有重启过，我怀疑是公网网络出了问题，不是 dns 服务程序本身的问题。所以提出再出现故障，上服务器 dig 127.0.0.1 看解析是否正常。

没想当晚问题马上又来了，立刻上服务器 dig，本以为自信认为不会出任何问题的，结果打脸了，心里瞬间慌了。

dig test.com @127.0.0.1../../../../lib/isc/unix/socket.c:2104: internal_send: 127.0.0.1#53: Invalid argument
../../../../lib/isc/unix/socket.c:2104: internal_send: 127.0.0.1#53: Invalid argument; <<>> DiG 9.10.6 <<>> test.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59732
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0;; QUESTION SECTION:
;test.com.			IN	A;; ANSWER SECTION:
test.com.		3600	IN	A	127.0.0.1;; Query time: 16 msec
;; SERVER: 192.168.50.1#53(192.168.50.1)
;; WHEN: Mon Apr 03 23:36:57 CST 2023
;; MSG SIZE  rcvd: 42

现象是，dig 执行后没有很快的响应，等待了一下手输出，并且在正常的信息前报了两个错误：
internal_send: 127.0.0.1#53: Invalid argument

查看内核日志

dmesg | head -10[529571.395313] neighbour: arp_cache: neighbor table overflow!
[529571.416911] neighbour: arp_cache: neighbor table overflow!
[529571.416915] neighbour: arp_cache: neighbor table overflow!

查看当前 arp 记录数

arp -an | wc -l796

查看 arp gc 阀值

sysctl -a | grep gc_threshnet.ipv4.neigh.default.gc_thresh1 = 128
net.ipv4.neigh.default.gc_thresh2 = 512
net.ipv4.neigh.default.gc_thresh3 = 1024
net.ipv6.neigh.default.gc_thresh1 = 128
net.ipv6.neigh.default.gc_thresh2 = 512
net.ipv6.neigh.default.gc_thresh3 = 1024

gc_thresh1: 少于这个值，gc 不会启动
gc_thresh2: ARP表的最多纪录的软限制，允许超过该数字5秒
gc_thresh3: ARP表的最多纪录的硬限制，大于该数目，gc立即启动，并强制回收

解决方案

调整节点内核参数，将 arp cache 的 gc 阀值调高 (/etc/sysctl.conf):

# Setup DNS threshold for arp
net.ipv4.neigh.default.gc_thresh3 = 8192
net.ipv4.neigh.default.gc_thresh2 = 4096
net.ipv4.neigh.default.gc_thresh1 = 1024

sysctl -p 使其生效