一、背景
做为一个DBA,在普天同乐的国庆节期间,还在像个值班老大爷一样,苦逼的值守着数据中心,这可能是国庆节最惨的事情了。今天在值班空闲之余,正好想起在前几天遇到的一个奇怪的redis问题,顺便就记录一下,如果大家能看到也可以为大家做个避坑参考。
为了较好满足我们的同城双活架构,公司的Redis缓存,主要采取的架构即为1主2从3哨兵的模式。在这种架构下,我们将主节点放置到双活中心的主站点,两个从节点则放置到双活中心的备站点。
这样安排则可以达到两个效果:
1.主站点的redis服务器节点出现异常(或者整个主站点出现异常),则由于其中2个哨兵均在备站点,处于1半以上的哨兵存活。此时,哨兵则会投票将两个从节点选出一个主并且进行拉起,redis可正常对外提供服务;
2.被站点的任意一台redis服务器节点出现异常(或者整个备站点出现异常),则由于主站点存活,则redis不会存在任何动作,redis可正常对外提供服务。
二、问题
为了保证后期的统一监控以及自动化运维的方便,我们通过制定标准化参数(最佳实践参数),对整个Redis的哨兵模式进行标准化安装,具体参数如下:
主redis.conf:
bind 0.0.0.0
port 12345
logfile "/app/redis/redis/etc/redis.log"
dir "/app/redis/redis/data"
requirepass "Redis"
masterauth "Redis"
daemonize yes
protected-mode yes
timeout 1200
save 3600 1
save 300 100
save 60 10000
stop-writes-on-bgsave-error yes
io-threads 2
maxmemory 6gb
maxmemory-policy volatile-lfu
maxclients 102400lazyfree-lazy-eviction yes
lazyfree-lazy-expire yes
lazyfree-lazy-server-del yes
replica-lazy-flush yes
hz 10
activedefrag yes
active-defrag-ignore-bytes 100mb
active-defrag-threshold-lower 10
active-defrag-threshold-upper 100
active-defrag-cycle-min 1
active-defrag-cycle-max 25
active-defrag-max-scan-fields 1000#safe-config
rename-command FLUSHALL ""
rename-command FLUSHDB ""
rename-command CONFIG ""
rename-command KEYS ""
从redis.conf
bind 0.0.0.0
port 12345
logfile "/app/redis/redis/etc/redis.log"
dir "/app/redis/redis/data"
requirepass "Redis"
masterauth "Redis"
daemonize yes
protected-mode yestimeout 1200
save 3600 1
save 300 100
save 60 10000
stop-writes-on-bgsave-error yes
io-threads 2
maxmemory 6gb
maxmemory-policy volatile-lfu
maxclients 102400lazyfree-lazy-eviction yes
lazyfree-lazy-expire yes
lazyfree-lazy-server-del yes
replica-lazy-flush yes
hz 10
activedefrag yes
active-defrag-ignore-bytes 100mb
active-defrag-threshold-lower 10
active-defrag-threshold-upper 100
active-defrag-cycle-min 1
active-defrag-cycle-max 25
active-defrag-max-scan-fields 1000replicaof redis1 12345#safe-config
rename-command FLUSHALL ""
rename-command FLUSHDB ""
rename-command CONFIG ""
rename-command KEYS ""
sentinel.conf
port 13245
bind 0.0.0.0
protected-mode yes
daemonize yes
logfile "/app/redis/redis/etc/sentinel.log"
sentinel monitor mymaster redis1 12345 2
sentinel auth-pass mymaster Redis
sentinel down-after-milliseconds mymaster 3000
sentinel resolve-hostnames yes
redis运行状态:
[webapp@redis1 etc]$ redis-cli -h redis1 -p 12345
redis1:12345>
redis1:12345>
redis1:12345> info replication
# Replication
role:master
connected_slaves:2
slave0:ip=192.168.248.137,port=12345,state=online,offset=104681,lag=1
slave1:ip=192.168.248.138,port=12345,state=online,offset=104681,lag=1
master_failover_state:no-failover
master_replid:fe9068494fc08cf4285db2fd3aa9235f23546c4c
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:105119
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:105119
可以看到此时redis状态为1主两从。
sentinel运行状态:
[webapp@redis1 etc]$ redis-cli -h redis1 -p 13245
redis1:13245> info sentinel
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=mymaster,status=ok,address=192.168.248.136:12345,slaves=2,sentinels=3
可以看到,此时有3个哨兵,其中192.168.248.136,即redis1为sentinel_master,slave有2个。
sentinel角色如下:
sentinel1------a562f0bb780a80b0db4c460c79f440b37441a64d
sentinel2------bb789141efebae52764b204f918c5b9f44d85347
sentinel3------3cfeb7cf40f21e21d61c68f1d92ca8bb01d1a1b1
为了保证系统在上线前,确保当出现主备节点出现异常的情况时,无需人工,redis可以自动切换,恢复服务,故需要进行高可用切换测试。
而就是在高可用测试过程中出现了问题!
为了模拟主站点机器出现异常,导致redis的主服务出现问题,直接将主节点redis服务杀掉,包括redis实例和sentinel服务,然后在2节点进行日志查看:
ps -ef|grep redis|grep -v grep|awk '{print $2}'|xargs kill -9
本来以为很顺利,但是却哨兵居然无法正常对redis进行切换,具体sentinel日志如下:
2003:X 04 Oct 2022 18:40:31.089 # +sdown master mymaster 192.168.248.136 12345
2003:X 04 Oct 2022 18:40:31.089 # +sdown sentinel a562f0bb780a80b0db4c460c79f440b37441a64d 192.168.248.136 13245 @ mymaster 192.168.248.136 12345
2003:X 04 Oct 2022 18:40:31.191 # +odown master mymaster 192.168.248.136 12345 #quorum 2/2
2003:X 04 Oct 2022 18:40:31.191 # +new-epoch 1
2003:X 04 Oct 2022 18:40:31.191 # +try-failover master mymaster 192.168.248.136 12345
2003:X 04 Oct 2022 18:40:31.193 # +vote-for-leader bb789141efebae52764b204f918c5b9f44d85347 1
2003:X 04 Oct 2022 18:40:31.202 # 3cfeb7cf40f21e21d61c68f1d92ca8bb01d1a1b1 voted for bb789141efebae52764b204f918c5b9f44d85347 1
2003:X 04 Oct 2022 18:40:31.256 # +elected-leader master mymaster 192.168.248.136 12345
2003:X 04 Oct 2022 18:40:31.256 # +failover-state-select-slave master mymaster 192.168.248.136 12345
2003:X 04 Oct 2022 18:40:31.318 # +selected-slave slave 192.168.248.138:12345 192.168.248.138 12345 @ mymaster 192.168.248.136 12345
2003:X 04 Oct 2022 18:40:31.318 * +failover-state-send-slaveof-noone slave 192.168.248.138:12345 192.168.248.138 12345 @ mymaster 192.168.248.136 12345
2003:X 04 Oct 2022 18:40:31.381 * +failover-state-wait-promotion slave 192.168.248.138:12345 192.168.248.138 12345 @ mymaster 192.168.248.136 123452003:X 04 Oct 2022 18:43:31.424 # -failover-abort-slave-timeout master mymaster 192.168.248.136 12345
2003:X 04 Oct 2022 18:43:31.483 # Next failover delay: I will not start a failover before Tue Oct 4 18:46:31 2022
从这里可以看到sentinel由于failover超时,导致切换延迟,并告知在几点之后进行下一次failover。
三、问题分析
为什么redis哨兵不进行切换呢?是因为哨兵没有正常工作么?通过对sentinel.log的日志进行分析,可以看出哨兵的大致工作流程如下:
2003:X 04 Oct 2022 18:40:31.089 # +sdown master mymaster 192.168.248.136 12345
#表示本节点识别到master发生了宕机;
2003:X 04 Oct 2022 18:40:31.089 # +sdown sentinel a562f0bb780a80b0db4c460c79f440b37441a64d 192.168.248.136 13245 @ mymaster 192.168.248.136 12345
#表示本节点识别到了sentinel1发生了宕机;
2003:X 04 Oct 2022 18:40:31.191 # +odown master mymaster 192.168.248.136 12345 #quorum 2/2
#表示剩余的两个哨兵均识别到了master发生宕机;
2003:X 04 Oct 2022 18:40:31.191 # +new-epoch 1
#表示进行第1轮切换;
2003:X 04 Oct 2022 18:40:31.191 # +try-failover master mymaster 192.168.248.136 12345
#表示哨兵尝试着对主节点进行failover操作;
2003:X 04 Oct 2022 18:40:31.193 # +vote-for-leader bb789141efebae52764b204f918c5b9f44d85347 1
#由于主哨兵也发生宕机,表示本节点选取sentinel2(本节点)为leader;
2003:X 04 Oct 2022 18:40:31.202 # 3cfeb7cf40f21e21d61c68f1d92ca8bb01d1a1b1 voted for bb789141efebae52764b204f918c5b9f44d85347 1
#表示sentinel3也投票给sentinel2为leader;
2003:X 04 Oct 2022 18:40:31.256 # +elected-leader master mymaster 192.168.248.136 12345
2003:X 04 Oct 2022 18:40:31.256 # +failover-state-select-slave master mymaster 192.168.248.136 12345
#尝试将redis1切换为从节点状态;
2003:X 04 Oct 2022 18:40:31.318 # +selected-slave slave 192.168.248.138:12345 192.168.248.138 12345 @ mymaster 192.168.248.136 12345
2003:X 04 Oct 2022 18:40:31.318 * +failover-state-send-slaveof-noone slave 192.168.248.138:12345 192.168.248.138 12345 @ mymaster 192.168.248.136 12345
2003:X 04 Oct 2022 18:40:31.381 * +failover-state-wait-promotion slave 192.168.248.138:12345 192.168.248.138 12345 @ mymaster 192.168.248.136 12345
#尝试将redis2切换为主节点;
2003:X 04 Oct 2022 18:43:31.424 # -failover-abort-slave-timeout master mymaster 192.168.248.136 12345
#将redis1切换为从节点由于超时失败;
2003:X 04 Oct 2022 18:43:31.483 # Next failover delay: I will not start a failover before Tue Oct 4 18:46:31 2022
#提示在下一个时间点,再次进行failover切换。
在sentinel的解读中,就可以发现明显的问题,即哨兵在切换节点的时候发生了超时,因此无法完成切换动作。因此,为什么会导致sentinel切换发生超时则是问题的关键。
四、问题解决及问题根因
经过一系列的资料查询,询问专家,找到了无法切换的众多因素,包括:
1)配置中没有写sentinel monitor mymaster、sentinel auth-pass mymaster的配置,导致sentinel无法登录主从,从而切换超时失败;
2)使用了protected-mode yes,但没有写bind,导致只允许本地登录,sentinel节点无法登录,从而无法切换;
3)配置中些了域名(或/etc/hosts),导致无法解析;
但是经过参数修改和测试,这些原因都被排除。
不过从sentinel的日志中其实可以注意到,出现问题的是redis实例,而sentinel实例则是正常。于是重点需要排查redis实例的参数配置。故逐一对配置中的参数进行注释并尝试切换。直到发现一个安全设置参数:
rename-command CONFIG ""
由于为了标准化并固定配置,希望一次性将配置全部写好,并不允许任何人修改或误操作故加入了几个重命名命令的操作:
#safe-config
rename-command FLUSHALL "" ----将FLUSHALL命令失效,防止人为执行FLUSHALL将redis删库
rename-command FLUSHDB "" ----将FLUSHDB失效,防止人为执行FLUSHDB将redis删库
rename-command CONFIG "" ----将CONFIG失效,防止人为修改参数
rename-command KEYS "" ----将KEYS失效,防止人为扫描所有键
而问题恰恰出现在这个标准化配置上。
我们将rename-command CONFIG ""进行删除,使redis.conf的配置如下:
主redis配置:
bind 0.0.0.0
port 12345
logfile "/app/redis/redis/etc/redis.log"
dir "/app/redis/redis/data"
requirepass "Redis"
masterauth "Redis"
daemonize yes
protected-mode yes
timeout 1200
save 3600 1
save 300 100
save 60 10000
stop-writes-on-bgsave-error yes
io-threads 2
maxmemory 6gb
maxmemory-policy volatile-lfu
maxclients 102400lazyfree-lazy-eviction yes
lazyfree-lazy-expire yes
lazyfree-lazy-server-del yes
replica-lazy-flush yes
hz 10
activedefrag yes
active-defrag-ignore-bytes 100mb
active-defrag-threshold-lower 10
active-defrag-threshold-upper 100
active-defrag-cycle-min 1
active-defrag-cycle-max 25
active-defrag-max-scan-fields 1000#safe-config
rename-command FLUSHALL ""
rename-command FLUSHDB ""
从redis配置:
bind 0.0.0.0
port 12345
logfile "/app/redis/redis/etc/redis.log"
dir "/app/redis/redis/data"
requirepass "Redis"
masterauth "Redis"
daemonize yes
protected-mode yestimeout 1200
save 3600 1
save 300 100
save 60 10000
stop-writes-on-bgsave-error yes
io-threads 2
maxmemory 6gb
maxmemory-policy volatile-lfu
maxclients 102400lazyfree-lazy-eviction yes
lazyfree-lazy-expire yes
lazyfree-lazy-server-del yes
replica-lazy-flush yes
hz 10
activedefrag yes
active-defrag-ignore-bytes 100mb
active-defrag-threshold-lower 10
active-defrag-threshold-upper 100
active-defrag-cycle-min 1
active-defrag-cycle-max 25
active-defrag-max-scan-fields 1000replicaof redis1 12345#safe-config
rename-command FLUSHALL ""
rename-command FLUSHDB ""
重启所有redis,并进行切换测试,此时发现sentinel的日志如下:
2444:X 05 Oct 2022 00:13:35.774 # +sdown master mymaster 192.168.248.136 12345
2444:X 05 Oct 2022 00:13:35.775 # +sdown sentinel a562f0bb780a80b0db4c460c79f440b37441a64d 192.168.248.136 13245 @ mymaster 192.168.248.136 12345
2444:X 05 Oct 2022 00:13:35.926 # +new-epoch 6
2444:X 05 Oct 2022 00:13:35.927 # +vote-for-leader 3cfeb7cf40f21e21d61c68f1d92ca8bb01d1a1b1 6
2444:X 05 Oct 2022 00:13:36.327 # +config-update-from sentinel 3cfeb7cf40f21e21d61c68f1d92ca8bb01d1a1b1 192.168.248.138 13245 @ mymaster 192.168.248.136 12345
2444:X 05 Oct 2022 00:13:36.327 # +switch-master mymaster 192.168.248.136 12345 192.168.248.137 12345
2444:X 05 Oct 2022 00:13:36.328 * +slave slave 192.168.248.138:12345 192.168.248.138 12345 @ mymaster 192.168.248.137 12345
2444:X 05 Oct 2022 00:13:36.328 * +slave slave 192.168.248.136:12345 192.168.248.136 12345 @ mymaster 192.168.248.137 12345
2444:X 05 Oct 2022 00:13:39.377 # +sdown slave 192.168.248.136:12345 192.168.248.136 12345 @ mymaster 192.168.248.137 12345
从上述日志中的+switch-master mymaster部分可以看到,此时master已经有136切换到137了,而136则被切换为了slave,且所有节点发现slave是down的。
而从redis2的实例日志中查看,则看到以下日志:
2437:S 05 Oct 2022 00:13:32.748 # Connection with master lost.
2437:S 05 Oct 2022 00:13:32.748 * Caching the disconnected master state.
2437:S 05 Oct 2022 00:13:32.748 * Reconnecting to MASTER redis1:12345
2437:S 05 Oct 2022 00:13:32.749 * MASTER <-> REPLICA sync started
2437:S 05 Oct 2022 00:13:32.749 # Error condition on socket for SYNC: Connection refused
2437:S 05 Oct 2022 00:13:33.712 * Connecting to MASTER redis1:12345
2437:S 05 Oct 2022 00:13:33.712 * MASTER <-> REPLICA sync started
2437:S 05 Oct 2022 00:13:33.713 # Error condition on socket for SYNC: Connection refused
2437:S 05 Oct 2022 00:13:34.720 * Connecting to MASTER redis1:12345
2437:S 05 Oct 2022 00:13:34.720 * MASTER <-> REPLICA sync started
2437:S 05 Oct 2022 00:13:34.720 # Error condition on socket for SYNC: Connection refused
2437:S 05 Oct 2022 00:13:35.726 * Connecting to MASTER redis1:12345
2437:S 05 Oct 2022 00:13:35.726 * MASTER <-> REPLICA sync started
2437:S 05 Oct 2022 00:13:35.727 # Error condition on socket for SYNC: Connection refused
2437:M 05 Oct 2022 00:13:36.167 * Discarding previously cached master state.
2437:M 05 Oct 2022 00:13:36.167 # Setting secondary replication ID to d5b7a91844ffec06ab3f9273051224b142fa3aea, valid up to offset: 62946. New replication ID is 3bbca1d4496bbf6d16978ff9d030bdce7f32db51
2437:M 05 Oct 2022 00:13:36.167 * MASTER MODE enabled (user request from 'id=9 addr=192.168.248.138:35899 laddr=192.168.248.137:12345 fd=12 name=sentinel-3cfeb7cf-cmd age=293 idle=0 flags=x db=0 sub=0 psub=0 multi=4 qbuf=188 qbuf-free=40766 argv-mem=4 obl=45 oll=0 omem=0 tot-mem=61468 events=r cmd=exec user=default redir=-1')
2437:M 05 Oct 2022 00:13:36.168 # CONFIG REWRITE executed with success.
2437:M 05 Oct 2022 00:13:36.330 * Replica 192.168.248.138:12345 asks for synchronization
2437:M 05 Oct 2022 00:13:36.330 * Partial resynchronization request from 192.168.248.138:12345 accepted. Sending 169 bytes of backlog starting from offset 62946.
从上述日志中可以看到,在00:13:32的时候,该节点联系不上主机点,并联系报错无法进行主从同步,而在00:13:36的时候该节点责备切换为主节点。之后,192.168.248.138与该节点开始进行同步,直到同步完成。
而在这个其中,我们可以从这里发现一条重要的信息:
2437:M 05 Oct 2022 00:13:36.168 # CONFIG REWRITE executed with success.
意思是,在这个节点上CONFIG REWRITE命令执行成功。
到这里,其实我们就可以发现Redis哨兵模式不切换的根本原因了,即哨兵在发现主节点宕机后,除了投票选出新的主redis节点。同时,也会连接到节点上去执行CONFIG REWRITE修改相关配置,这点我们也可以在redis.conf中看到:
# Generated by CONFIG REWRITE
pidfile "/var/run/redis.pid"
user default on #b9fca0020931209c57decc10102d72f80b6732977323c473586366f5f43c1d51 ~* &* +@all
可以看到在redis2的redis.conf的配置文件中多了两行配置信息,且出现了Generated by CONFIG REWRITE的提示。
而在最初的配置当中,我们则将CONFIG命令进行了rename-command为空的操作,所以当哨兵想要切换时,在进行config rewrite操作的时候,则找不到该命令,无法修改相关参数。最终导致切换卡住,直到超时,最终切换失败。