Oracle特殊恢复:异常掉电导致的ORA-600 [kfrValAcd30]故障处理

news/2024/9/24 13:20:20/

一、 问题描述

现象:硬件掉电后,oracle集群无法启动。

[root@rac2 ~]# crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
[root@rac2 ~]# crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.

二、 故障处理

查看集群组件发现ora.asm状态为offline

root@rac2 ~]# crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm1        ONLINE  OFFLINE                               Instance Shutdown   
ora.cluster_interconnect.haip1        ONLINE  ONLINE       rac2                                         
ora.crf1        ONLINE  ONLINE       rac2                                         
ora.crsd1        ONLINE  OFFLINE                                                   
ora.cssd1        ONLINE  ONLINE       rac2                                         
ora.cssdmonitor1        ONLINE  ONLINE       rac2                                         
ora.ctssd1        ONLINE  ONLINE       rac2                     OBSERVER            
ora.diskmon1        OFFLINE OFFLINE                                                   
ora.drivers.acfs1        ONLINE  ONLINE       rac2                                         
ora.evmd1        ONLINE  INTERMEDIATE rac2                                         
ora.gipcd1        ONLINE  ONLINE       rac2                                         
ora.gpnpd1        ONLINE  ONLINE       rac2                                         
ora.mdnsd1        ONLINE  ONLINE       rac2                                         

查看grid alert日志发现磁盘组没有mount

[ohasd(4329)]CRS-2769:Unable to failover resource 'ora.diskmon'.
2018-05-08 04:12:24.940:
[cssd(4576)]CRS-1707:Lease acquisition for node rac2 number 2 completed
2018-05-08 04:12:26.188:
[cssd(4576)]CRS-1605:CSSD voting file is online: /dev/asmdisk/oraasm-OCR_0000; details in /u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log.
2018-05-08 04:12:28.723:
[cssd(4576)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 rac2 .
2018-05-08 04:12:30.617:
[ctssd(4660)]CRS-2401:The Cluster Time Synchronization Service started on host rac2.
2018-05-08 04:12:30.617:
[ctssd(4660)]CRS-2407:The new Cluster Time Synchronization Service reference node is host rac1.
2018-05-08 04:12:32.348:
[ohasd(4329)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2018-05-08 04:12:32.348:
[ohasd(4329)]CRS-2769:Unable to failover resource 'ora.diskmon'.

查看asm_alert,出现ORA-00600 [kfrValAcd30]的报错

NOTE: GMON heartbeating for grp 2
GMON querying group 2 at 6 for pid 23, osid 5727
NOTE: cache opening disk 0 of grp 2: DATA_0000 path:/dev/asmdisk/oraasm-ASM_0000
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: cache opening disk 1 of grp 2: DATA_0001 path:/dev/asmdisk/oraasm-ASM_0001
NOTE: F1X0 found on disk 1 au 2 fcn 0.0
NOTE: cache opening disk 2 of grp 2: DATA_0002 path:/dev/asmdisk/oraasm-ASM_0002
NOTE: F1X0 found on disk 2 au 2 fcn 0.0
NOTE: cache opening disk 3 of grp 2: DATA_0003 path:/dev/asmdisk/oraasm-ASM_0003
NOTE: cache mounting (first) normal redundancy group 2/0x877A96CD (DATA)
* allocate domain 2, invalid = TRUE
NOTE: attached to recovery domain 2
NOTE: starting recovery of thread=1 ckpt=8.390 group=2 (DATA)
Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_5727.trc  (incident=50111):
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [8], [390], [9], [390], [], [], [], [], []
ORA-15017: diskgroup "ASM" cannot be mounted
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "ASM"
Incident details in: /u01/app/grid/diag/asm/+asm/+ASM2/incident/incdir_50111/+ASM2_ora_5727_i50111.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_5727.trc:
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [8], [390], [9], [390], [], [], [], [], []
ORA-15017: diskgroup "ASM" cannot be mounted
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "ASM"
Abort recovery for domain 2
NOTE: crash recovery signalled OER-600
ERROR: ORA-600 signalled during mount of diskgroup DATA
NOTE: cache dismounting (clean) group 2/0x877A96CD (DATA)
NOTE: messaging CKPT to quiesce pins Unix process pid: 5727, image: oracle@rac2 (TNS V1-V3)

查看ASM2_ora_5727_i50111.trc

kfdp_query(DATA): 6
----- Abridged Call Stack Trace -----
ksedsts()+465<-kfdp_query()+530<-kfdPstSyncPriv()+585<-kfgFinalizeMount()+1630<-kfgscFinalize()+1433<-kfgForEachKfgsc()+285<-kfgsoFinalize()+135<-kfgFinalize()+398<-kfxdrvMount()+5558<-kfxdrvEntry()+2207<-opiexe()+20624<-opiosq0()+3932<-kpooprx()+274<-kpoal8()+842
<-opiodr()+917<-ttcpip()+2183<-opitsk()+1710<-opiino()+969<-opiodr()+917<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253
----- End of Abridged Call Stack Trace -----
2018-05-17 21:54:44.321623 : Start recovery for domain=2, valid=0, flags=0x4
NOTE: starting recovery of thread=1 ckpt=8.390 group=2 (DATA)
Incident 50111 created, dump file: /u01/app/grid/diag/asm/+asm/+ASM2/incident/incdir_50111/+ASM2_ora_5727_i50111.trc
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [8], [390], [9], [390], [], [], [], [], []
ORA-15017: diskgroup "ASM" cannot be mounted
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "ASM"recovery of group DATA failed due to the following error(s):
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [8], [390], [9], [390], [], [], [], [], []
ORA-15017: diskgroup "ASM" cannot be mounted
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "ASM"
2018-05-17 21:54:45.016907 : Abort recovery for domain 2, flags = 0x4
2018-05-17 21:54:45.017799 : kjb_abort_recovery: domain flags=0x0, valid=0
NOTE: messaging CKPT to quiesce pins Unix process pid: 5727, image: oracle@rac2 (TNS V1-V3)
kfdp_dismount(): 7
----- Abridged Call Stack Trace -----
ksedsts()+465<-kfdp_dismountInt()+389<-kfdp_dismount()+11<-kfgTermCache()+347<-kfgRecoverDismount()+373<-kfgRecoverMount()+277<-kfgscDelete()+2742<-kssdel()+155<-kfgscFinalize()+1211<-kfgForEachKfgsc()+285<-kfgsoFinalize()+135<-kfgFinalize()+398<-kfxdrvMount()+5558
<-kfxdrvEntry()+2207<-opiexe()+20624<-opiosq0()+3932<-kpooprx()+274<-kpoal8()+842<-opiodr()+917<-ttcpip()+2183<-opitsk()+1710<-opiino()+969<-opiodr()+917<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253
----- End of Abridged Call Stack Trace -----
ASM name of disk 0x9f1d9488 (2:0:DATA_0000:/dev/asmdisk/oraasm-ASM_0000) is being cleared
ASM name of disk 0x9f1d9808 (2:1:DATA_0001:/dev/asmdisk/oraasm-ASM_0001) is being cleared
ASM name of disk 0x9f1d9108 (2:2:DATA_0002:/dev/asmdisk/oraasm-ASM_0002) is being cleared
ASM name of disk 0x9f1d8d88 (2:3:DATA_0003:/dev/asmdisk/oraasm-ASM_0003) is being cleared
NOTE: messaging CKPT to quiesce pins Unix process pid: 5727, image: oracle@rac2 (TNS V1-V3)
kfdp_dismount(): 8
----- Abridged Call Stack Trace -----
ksedsts()+465<-kfdp_dismountInt()+389<-kfdp_dismount()+11<-kfgTermCache()+347<-kfgRecoverDismount()+373<-kfgRecoverMount()+277<-kfgscDelete()+2742<-kssdct()+258<-kfgsoDelete()+721<-kssdel()+155<-kssdch()+6849<-ksures()+52<-opiosq0()+4721<-kpooprx()+274<-kpoal8()+842
<-opiodr()+917<-ttcpip()+2183<-opitsk()+1710<-opiino()+969<-opiodr()+917<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253
----- End of Abridged Call Stack Trace -----

搜索mos也发现了几个Bug:

Bug 27288230 - ORA-600: [kfrvalacd30] in ASM SMON (Doc ID 27288230.8)	
Bug 19064132 : ORA-600 [KFRVALACD30] MOUNTING DISKGROUP AFTER RESTARTING SERVERS
ORA-600 [KFRVALACD30] in ASM (Doc ID 2123013.1)
bug在11.2.0.1/11.2.0.3/112.0.4/12.1/18.1 都存在。

Oracle认为这是存储或者OS问题导致asm acd block的元数据不一致了
可能导致ASM元数据的主辅扩展区都发生损坏。这个损坏会导致rebalance挂起或不断尝试失败,或者阻止磁盘组被挂载。

strace出现ORA 600 [kfrValAcd30]报错,但没发现其他可用的信息

write(25, "string_value", 12)           = 12
write(25, ">", 1)                       = 1
write(25, "ORA 600 [kfrValAcd30]", 21)  = 21
write(25, "</", 2)                      = 2
write(25, "string_value", 12)           = 12
write(25, ">", 1)                       = 1
write(25, "</", 2)                      = 2
write(25, "pkey_dbgristih", 14)         = 14
write(25, ">", 1)                       = 1
write(25, "\n", 1)                      = 1
write(25, "<", 1)                       = 1
write(25, "creatm_dbgristih", 16)       = 16
write(25, ">", 1)                       = 1
write(25, "<", 1)                       = 1
write(25, "time", 4)                    = 4
write(25, ">", 1)                       = 1

ASM的Active Change Directory(ACD)简单来说就相当于asm元数据重做记录,需要说明的是,每一个asm实例,都用有其自己的ACD目录,也就说,如果你是双节点的rac,那么就有84m的ACD 目录信息。
ACD主要是记录了这样一些数据:线程号,block号,opcode,sequences,记录长度等信息。
当处于运行的asm实例,突然crash后,那么重启asm实例以后,asm可以根据ACD信息去进行”instance recover”,从而保证能够正常启动asm实例。
如果acd的信息出现损坏,磁盘组将不能够被mount。

使用kfed来读取元数据,首先定位到active change directory 所在AU

[grid@rac2 ~]$  kfed read  /dev/dm-0  aun=2 blkn=3|grep .xptr.au |head -1
kfffde[0].xptr.au:                    4 ; 0x4a0: 0x00000004

从上面信息可以看出,ACD元数据和数据应该包含在其中一个AU 4

查看au 4的信息

grid@rac2 ~]$  kfed read /dev/dm-0 aun=4 blkn=1|more
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            8 ; 0x002: KFBTYP_CHNGDIR
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                       1 ; 0x004: blk=1
kfbh.block.obj:                       3 ; 0x008: file=3
kfbh.check:                    17400326 ; 0x00c: 0x01098206
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfracdb.aba.seq:                      2 ; 0x000: 0x00000002
kfracdb.aba.blk:                      0 ; 0x004: 0x00000000
kfracdb.ents:                         2 ; 0x008: 0x0002
kfracdb.ub2spare:                     0 ; 0x00a: 0x0000
kfracdb.lge[0].valid:                 1 ; 0x00c: V=1 B=0 M=0
kfracdb.lge[0].chgCount:              1 ; 0x00d: 0x01
kfracdb.lge[0].len:                  52 ; 0x00e: 0x0034
kfracdb.lge[0].kfcn.base:             1 ; 0x010: 0x00000001
kfracdb.lge[0].kfcn.wrap:             0 ; 0x014: 0x00000000
kfracdb.lge[0].bcd[0].kfbl.blk:       0 ; 0x018: blk=0
kfracdb.lge[0].bcd[0].kfbl.obj:       4 ; 0x01c: file=4
kfracdb.lge[0].bcd[0].kfcn.base:      0 ; 0x020: 0x00000000
kfracdb.lge[0].bcd[0].kfcn.wrap:      0 ; 0x024: 0x00000000
kfracdb.lge[0].bcd[0].oplen:          4 ; 0x028: 0x0004  ---表示长度,类似logfile dump的LEN
kfracdb.lge[0].bcd[0].blkIndex:       0 ; 0x02a: 0x0000
kfracdb.lge[0].bcd[0].flags:         28 ; 0x02c: F=0 N=0 F=1 L=1 V=1 A=0 C=0
kfracdb.lge[0].bcd[0].opcode:       212 ; 0x02e: 0x00d4  --opcode,类似数据库实例中的update/delete/insert操作的opcode编号
kfracdb.lge[0].bcd[0].kfbtyp:         9 ; 0x030: KFBTYP_COD_BGO  --操作类型,类似数据库实例中的update/delete/insert等类型
kfracdb.lge[0].bcd[0].redund:        17 ; 0x031: SCHE=0x1 NUMB=0x1  --这里表示冗余级别,17是unport,18是mirror,19表示high
kfracdb.lge[0].bcd[0].pad:        63903 ; 0x032: 0xf99f
kfracdb.lge[0].bcd[0].KFRCOD_CRASH:   1 ; 0x034: 0x00000001
kfracdb.lge[0].bcd[0].au[0]:         46 ; 0x038: 0x0000002e
kfracdb.lge[0].bcd[0].disks[0]:       0 ; 0x03c: 0x0000
kfracdb.lge[1].valid:                 1 ; 0x040: V=1 B=0 M=0
kfracdb.lge[1].chgCount:              1 ; 0x041: 0x01

check等信息属于hash值,每隔3s都会更新一次,可能是由于突然掉电,cache里的信息没有更新到磁盘中导致。

[grid@rac2 ~]$  kfed read /dev/dm-2 aun=4 blkn=0|grep ckpt
kfracdc.ckpt.seq:                     8 ; 0x018: 0x00000008
kfracdc.ckpt.blk:                   390 ; 0x01c: 0x00000186
[grid@rac2 ~]$  kfed read /dev/dm-0 aun=4 blkn=0|grep ckpt
kfracdc.ckpt.seq:                    70 ; 0x018: 0x00000046
kfracdc.ckpt.blk:                   255 ; 0x01c: 0x000000f

三、 问题解决

通过kfed merge手工修改的方法恢复

kfed read /dev/dm-2 aun=4 blkn=0 >acd.txt
kfed merge /dev/dm-2 aun=4 blkn=0  text=acd.txt

手工mount磁盘组

SQL>  alter diskgroup data mount;alter diskgroup data mount
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [8],
[390], [9], [390], [], [], [], [], []SQL> alter diskgroup data mount;
Diskgroup altered.

如果kfed merge无法修复,请及时联系,以通过odu对asm底层进行恢复数据文件,从而最大限度恢复数据.

作者介绍:云和恩墨资深Oracle dba,专注于数据库运维、架构和行业发展,有12年左右的金融、保险、政府、地税、运营商等业务关键型系统的运维经验,曾担任公司异常恢复东区接口人,负责紧急异常恢复工作,技术二线专家。目前负责PG、openGauss/MogDB运维、国产化MogDB数据库的推广工作。


http://www.ppmy.cn/news/1433786.html

相关文章

Linux下如何使用Git进行版本控制

Linux下使用Git进行版本控制 介绍&#xff1a; Git是一种分布式版本控制系统&#xff0c;最初由Linus Torvalds在2005年开发并在Linux内核项目中使用。它具有高效、快速、简单、灵活和强大的特性&#xff0c;因此在开源社区和软件开发行业的广泛使用。 Git的工作原理&#x…

UniApp 中的路由守卫与拦截器:守护应用的每一步

正文&#xff1a; 路由守卫和拦截器在前端开发中扮演着重要的角色&#xff0c;它们可以用来控制页面访问权限、全局请求拦截等。在 UniApp 中&#xff0c;路由守卫和拦截器同样具有强大的功能&#xff0c;能够保护应用的安全和稳定性。本文将深入探讨 UniApp 中的路由守卫和拦…

Harmony OS应用开发性能优化全面指南

优化应用性能对于应用开发至关重要。通过高性能编程、减少丢帧卡顿、提升应用启动和响应速度&#xff0c;可以有效提升用户体验。本文将介绍一些优化应用性能的方法&#xff0c;以及常用的性能调优工具。 ArkTS高性能编程 为了提升代码执行速度&#xff0c;进而提升应用整体性…

Docker容器搭建Hadoop集群(hadoop-3.1.3)

Docker容器环境下搭建Hadoop集群&#xff08;完全分布式&#xff09; hadoop版本为hadoop-3.1.3 &#xff08;1&#xff09;安装额外的速度较快的镜像库 yum install -y epel-release &#xff08;2&#xff09;安装同步工具&#xff0c;方便在多台服务器上进行文件的传输 …

工业自动化,3D视觉技术3C薄片自动化上料

随着制造业的快速发展&#xff0c;3C行业对薄片类零件的上料需求日益增长。传统的上料方式往往依赖于人工操作&#xff0c;效率低下且存在误差。为了解决这一问题&#xff0c;3D视觉技术应运而生&#xff0c;为3C薄片自动化上料提供了强大的技术支持。本文将探讨3D视觉技术如何…

解锁ApplicationContext vs BeanFactory: 谁更具选择性?

目录 一、聚焦源码回顾 &#xff08;一&#xff09;源码分析和理解 &#xff08;二&#xff09;简短的回顾对比建议 二、ApplicationContext vs BeanFactory特性对比 &#xff08;一&#xff09;主要特性总结 &#xff08;二&#xff09;直接建议 三、案例简单说明 &am…

演示在一台Windows主机上运行两个Mysql服务器(端口号3306 和 3307),安装步骤详解

目录 在一台Windows主机上运行两个Mysql服务器&#xff0c;安装步骤详解因为演示需要两个 MySQL 服务器终端&#xff0c;我只有一个 3306 端口号的 MySQL 服务器&#xff0c;所以需要再创建一个 3307 的。创建一个3307端口号的MySQL服务器1、复制 mysql 的安装目录2、修改my.in…

idea运行报错:启动命令过长

JAVA项目&#xff0c;运行的时候报错 Command line is too long. Shorten the command line via JAR manifest or via a classpath file and rerun老问题了&#xff0c;记录一下 解决办法&#xff1a; 1、Edit Configurations 2、点击Modify options设置&#xff0c;勾选S…