最近碰到了一个关于ASM无法启动的案例,当然这个案例比较长,准备分两篇来写。
问题的背景如下:
目前存在一套standalone的环境,采用了ASM作为存储管理,业务属于实时统计,在某一天下班的时候开发的同事突然联系我说,数据库感觉有些问题,因为部分应用开始报错了,然后他们问我在这段时间做过什么操作没有,从我的印象来看下午4点只对部分分区做了例行维护,其它什么都没调整。问题是在6点开始发现,所以应该没有什么关联。
当然口说无凭,开发的同学也反馈了报错的信息。
反馈错误信息1:Yourdatabase connection has timed out, either due to network problems or due toOracle reaching the IDLE_TIME profile limit.
反馈错误信息2:Requestprocessing failed; nested exception is java.sql.SQLException: No more data toread from socket
所以马上查看数据库连接数,发现只有80多个,排查了profile的情况,没有发现什么问题。
这个时候脑海中开始闪现各种可能性,目前发现只有两种可能,一种是例行维护的操作。另外一种是另外一个毫不相关的机房网络调整。对于例行维护,可以基本断定出现问题的概率较小,因为是truncate了部分的分区,而且分区数也不多,应该没有什么问题,除非出现了意料之外的bug。而目前的情况是更多倾向于网络层面,因为看开发提供的报错信息,似乎出现了网络的问题,所以也协调IT的同学帮忙看看是不是机房调整不小心影响到了另外一个机房的服务器。
但是反馈的情况还是没有关联,所以这个网络的问题似乎是排除了,当然这个确认也花了些时间。而且这个时候我已经在地铁上了,联系同事,他反馈数据库里面有一个ORA的错误是一个297开头的错误,这个听起来还是有些陌生,而且同事反馈似乎数据库是有种hang的感觉,这个时候尝试shutdown immediate都会卡住,没有反应。这个时候感觉应该不是简单的网络问题了,最后为了业务的使用,同事帮忙先做了failover的操作,主库的这个问题先保留现场,等我回去马上查看。
发现这个时候ASM实例是无法启动的了。
$sqlplus / as sysasm
SQL*Plus:Release 11.2.0.3.0 Production on Sat Jan 21 20:34:31 2016
Copyright(c) 1982, 2011, Oracle. All rightsreserved.
Connectedto an idle instance.
SQL>startup
ORA-01078:failure in processing system parameters
ORA-29701:unable to connect to Cluster Synchronization Service
数据库中的错误日志类似下面的形式:
unrecoverableerror ORA-29701 raised in ASM I/O path; terminating process 7303
ThuJan 21 06:58:27 2016
ERROR:unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 7593
ThuJan 21 07:00:22 2016
ERROR:unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 7644
ThuJan 21 07:00:51 2016
ERROR:unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 7659
对于这个错误,检查了CSS和HAS的服务情况,发现都是Online状态。
$crsctl check css
CRS-4529:Cluster Synchronization Services is online
$crsctl check has
CRS-4638:Oracle High Availability Services is online
这个时候的问题是这台服务器也没有重启过,出现了这个奇怪的问题。而且上一次重启ASM是自动启动的。
查看服务的情况,发现db的服务显示是UNKNOWN
$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.DATA01.dg ora....up.type ONLINE ONLINE rolequery
ora.FLASH01.dgora....up.type ONLINE ONLINE rolequery
ora....ER.lsnrora....er.type ONLINE ONLINE rolequery
ora.asm ora.asm.type ONLINE ONLINE rolequery
ora.cssd ora.cssd.type ONLINE ONLINE rolequery
ora.testbi.db ora....se.type ONLINE UNKNOWN rolequery
ora.diskmon ora....on.type OFFLINE OFFLINE
ora.evmd ora.evm.type ONLINE ONLINE rolequery
ora.ons ora.ons.type OFFLINE OFFLINE
ora....rydb.dbora....se.type ONLINE ONLINE rolequery
这个时候感觉是CSS的部分出了问题,但是还没有定位到,使用crs_stat -p查看更明细的信息。
$crs_stat -p ora.cssd
NAME=ora.cssd
TYPE=ora.cssd.type
ACTION_SCRIPT=
ACTIVE_PLACEMENT=0
AUTO_START=1
CHECK_INTERVAL=30
DESCRIPTION="Resourcetype for CSSD"
FAILOVER_DELAY=0
FAILURE_INTERVAL=3
FAILURE_THRESHOLD=5
HOSTING_MEMBERS=
PLACEMENT=balanced
RESTART_ATTEMPTS=5
SCRIPT_TIMEOUT=600
START_TIMEOUT=600
STOP_TIMEOUT=900
UPTIME_THRESHOLD=1m
当然尝试重启CSS的服务,停服务是成功了。
$crsctl start resource ora.cssd
CRS-5702:Resource 'ora.cssd' is already running on 'rolequery'
CRS-4000:Command Start failed, or completed with errors.
$crs_stop ora.testbi.db
Attemptingto stop `ora.testbi.db` on member `rolequery`
Stopof `ora.testbi.db` on member `rolequery` succeeded.
但是重启报错了,尝试通过crs_start启动数据库实例的服务还是报错。
$crs_start ora.testbi.db
Attemptingto start `ora.asm` on member `rolequery`
CRS-5017:The resource action "ora.asm start" encountered the following error:
ORA-01078:failure in processing system parameters
ORA-29701:unable to connect to Cluster Synchronization Service
.For details refer to "(:CLSN00107:)" in"/home/U01/app/grid/11.2.3/log/testbi/agent/ohasd/oraagent_grid/oraagent_grid.log".
Startof `ora.asm` on member `rolequery` failed.
Attemptingto stop `ora.asm` on member `rolequery`
Stopof `ora.asm` on member `rolequery` succeeded.
CRS-0215:Could not start resource 'ora.testbi.db'.
查看GI中的日志如下:
[ clsdmc][1431308032]Fail to connect(ADDRESS=(PROTOCOL=ipc)(KEY=testbiDBG_EVMD)) with status 9
2016-01-2200:11:22.811: [ora.evmd][1431308032] {0:10:5} [start] Error = error 9encountered when connecting to EVMD
2016-01-2200:11:23.811: [ora.evmd][1431308032] {0:10:5} [start] without returnbuf
2016-01-2200:11:23.811: [ COMMCRS][1433937664]clsc_connect: (0x7f251403a630) no listenerat (ADDRESS=(PROTOCOL=ipc)(KEY=testbiDBG_EVMD))
[ clsdmc][1431308032]Fail to connect(ADDRESS=(PROTOCOL=ipc)(KEY=testbiDBG_EVMD)) with status 9
2016-01-2200:11:23.812: [ora.evmd][1431308032] {0:10:5} [start] Error = error 9encountered when connecting to EVMD
2016-01-2200:11:24.741: [ora.DATA01.dg][1450747648] {0:10:3} [check] DgpAgent::getConnxncheck 9
2016-01-2200:11:24.741: [ora.DATA01.dg][1450747648] {0:10:3} [check] ConnectionPool::getConnection160 ERROR sid +ASM pool status 1
2016-01-2200:11:24.741: [ora.DATA01.dg][1450747648] {0:10:3} [check]ConnectionPool::getConnection 260 pConnxn 00000000
2016-01-2200:11:24.741: [ora.DATA01.dg][1450747648] {0:10:3} [check] DgpAgent::getConnxnconnection failure 2
2016-01-2200:11:24.741: [ora.DATA01.dg][1450747648] {0:10:3} [check] DgpAgent::getConnxnfailed DgpAgent::getConnxn unable to connect
2016-01-2200:11:24.741: [ora.DATA01.dg][1450747648] {0:10:3} [check] DgpAgent::getConnxnwait then retry
2016-01-2200:11:24.751: [ora.FLASH01.dg][1436038912] {0:10:3} [check] DgpAgent::getConnxncheck 9
2016-01-2200:11:24.751: [ora.FLASH01.dg][1436038912] {0:10:3} [check]ConnectionPool::getConnection 160 ERROR sid +ASM pool status 1
2016-01-2200:11:24.751: [ora.FLASH01.dg][1436038912] {0:10:3} [check]ConnectionPool::getConnection 260 pConnxn 00000000
2016-01-2200:11:24.751: [ora.FLASH01.dg][1436038912] {0:10:3} [check] DgpAgent::getConnxnconnection failure 2
2016-01-2200:11:24.751: [ora.FLASH01.dg][1436038912] {0:10:3} [check] DgpAgent::getConnxnfailed DgpAgent::getConnxn unable to connect
2016-01-2200:11:24.751: [ora.FLASH01.dg][1436038912] {0:10:3} [check] DgpAgent::getConnxnwait then retry
2016-01-2200:11:24.812: [ora.evmd][1431308032] {0:10:5} [start] without returnbuf
2016-01-2200:11:24.812: [ COMMCRS][1433937664]clsc_connect: (0x7f251403a630) no listenerat (ADDRESS=(PROTOCOL=ipc)(KEY=testbiDBG_EVMD))
[ clsdmc][1431308032]Fail to connect (ADDRESS=(PROTOCOL=ipc)(KEY=testbiDBG_EVMD))with status 9
2016-01-2200:11:24.812: [ora.evmd][1431308032] {0:10:5} [start] Error = error 9encountered when connecting to EVMD
2016-01-2200:11:25.813: [ora.evmd][1431308032] {0:10:5} [start] without returnbuf
2016-01-2200:11:25.813: [ COMMCRS][1433937664]clsc_connect: (0x7f251403a630) no listenerat (ADDRESS=(PROTOCOL=ipc)(KEY=testbiDBG_EVMD))
[ clsdmc][1431308032]Fail to connect(ADDRESS=(PROTOCOL=ipc)(KEY=testbiDBG_EVMD)) with status 9
2016-01-2200:11:25.813: [ora.evmd][1431308032] {0:10:5} [start] Error = error 9encountered when connecting to EVMD
这个时候cssd的状态还是ONLINE,db的状态是offline.
$crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.DATA01.dg ora....up.type ONLINE ONLINE rolequery
ora.FLASH01.dgora....up.type ONLINE ONLINE rolequery
ora....ER.lsnrora....er.type ONLINE ONLINE rolequery
ora.asm ora.asm.type ONLINE OFFLINE
ora.cssd ora.cssd.type ONLINE ONLINE rolequery
ora.testbi.db ora....se.type ONLINE OFFLINE
ora.testbi1.dbora....se.type OFFLINE OFFLINE
ora.diskmon ora....on.type OFFLINE OFFLINE
ora.evmd ora.evm.type ONLINE OFFLINE
ora.ons ora.ons.type OFFLINE OFFLINE
ora....rydb.dbora....se.type ONLINE OFFLINE
查看所有的资源状态:
$ crsctl stat res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
LocalResources
--------------------------------------------------------------------------------
ora.DATA01.dg
ONLINE INTERMEDIATE rolequery CHECK TIMED OUT
ora.FLASH01.dg
ONLINE INTERMEDIATE rolequery CHECK TIMED OUT
ora.LISTENER.lsnr
ONLINE ONLINE rolequery
ora.asm
ONLINE OFFLINE rolequery InstanceShutdown
ora.ons
OFFLINE OFFLINE rolequery
--------------------------------------------------------------------------------
ClusterResources
--------------------------------------------------------------------------------
ora.cssd
1 ONLINE ONLINE rolequery
ora.testbi.db
1 ONLINE OFFLINE InstanceShutdown
ora.testbi1.db
1 OFFLINE OFFLINE InstanceShutdown
ora.diskmon
1 OFFLINE OFFLINE
ora.evmd
1 ONLINE OFFLINE
ora.rolequerydb.db
ONLINE OFFLINE
这个时候感觉问题比较奇怪,常规的配置都没有发现问题,但是CSS感觉还是出现了问题。
再次检查。
$crsctl check has
CRS-4638:Oracle High Availability Services is online
$crsctl check evm
CRS-4534:Cannot communicate with Event Manager
$crsctl check css
CRS-4529:Cluster Synchronization Services is online
最后使用kfod查看一下asm配置层面是否有问题。
$kfod asm_diskstring='/dev/asm-sda9' disk=all
--------------------------------------------------------------------------------
Disk Size Path User Group
================================================================================
1: 153605 Mb /dev/asm-sda9 grid asmadmin
KFOD-00301:Unable to contact Cluster Synchronization Services (CSS). Return code 2 fromkgxgncin.
所以通过这些我更加坚信的问题就是CSS中的问题,所以后续需要花一些功夫从以这个为出发点来分析这个问题。