视频1 视频21 视频41 视频61 视频文章1 视频文章21 视频文章41 视频文章61 推荐1 推荐3 推荐5 推荐7 推荐9 推荐11 推荐13 推荐15 推荐17 推荐19 推荐21 推荐23 推荐25 推荐27 推荐29 推荐31 推荐33 推荐35 推荐37 推荐39 推荐41 推荐43 推荐45 推荐47 推荐49 关键词1 关键词101 关键词201 关键词301 关键词401 关键词501 关键词601 关键词701 关键词801 关键词901 关键词1001 关键词1101 关键词1201 关键词1301 关键词1401 关键词1501 关键词1601 关键词1701 关键词1801 关键词1901 视频扩展1 视频扩展6 视频扩展11 视频扩展16 文章1 文章201 文章401 文章601 文章801 文章1001 资讯1 资讯501 资讯1001 资讯1501 标签1 标签501 标签1001 关键词1 关键词501 关键词1001 关键词1501 专题2001
打完AIX操作系统补丁,ASMhang磁盘无法mount
2020-11-09 13:05:58 责编:小采
文档


同事的一套数据库是AIX系统的,打完操作系统补丁之后,ASM实例启动无法mount磁盘组,出现hang的情况。然后我们对这个hang的过程做了一个hang analyze,hang分析的文件如下所示: Chains most likely to have caused the hang: [a] Chain 1 Signature: ='rdbm

同事的一套数据库是AIX系统的,打完操作系统补丁之后,ASM实例启动无法mount磁盘组,出现hang的情况。然后我们对这个hang的过程做了一个hang analyze,hang分析的文件如下所示:

Chains most likely to have caused the hang:
 [a] Chain 1 Signature: <='rdbms ipc reply'<='enq: DD - contention'
 Chain 1 Signature Hash: 0x7bd12357
 [b] Chain 2 Signature: <='rdbms ipc reply'<='enq: DD - contention'
 Chain 2 Signature Hash: 0x7bd12357
===============================================================================
Non-intersecting chains:
-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
 Oracle session identified by:
 {
 instance: 1 (+asm.+asm)
 os id: 58472
 process id: 20, oracle@NG-CSDB (TNS V1-V3)
 session id: 1261
 session serial #: 1
 }
 is waiting for 'enq: DD - contention' with wait info:
 {
 p1: 'name|mode'=0x44440006
 p2: 'disk group'=0x0
 p3: 'type'=0x1
 time in wait: 2 min 13 sec
 timeout after: never
 wait id: 10
 blocking: 0 sessions
 current sql: ALTER DISKGROUP WPSDG MOUNT /* asm agent *//* {0:0:324} */
 short stack: ksedsts()+360<-ksdxfstk()+44<-ksdxcb()+3384<-sspuser()+116<-48b8<-sskgpwwait()+32<-skgpwwait()+180<-ksliwat()+11032<-kslwaitctx()+180<-ksqcmi()+5604<-ksqgtlctx()+2904<-ksqgelctx()+660<-kfgUseDmt()+424<-kfxdrvMount()+1092<-kfxdrvEntry()+2296<-opiexe()+18568<-opiosq0()+28<-kpooprx()+316<-kpoal8()+692<-opiodr()+720<-ttcpip()+1028<-opitsk()+1508<-opiino()+940<-opiodr()+720<-opidrv()+1132<-sou2o()+136<-opimai_real()+608<-ssthrdmain()+268<-main()+204<-__start()+112
 wait history:
 * time between current wait and wait #1: 0.000979 sec
 1. event: 'SQL*Net message from client'
 time waited: 0.000099 sec
 wait id: 9 p1: 'driver id'=0x62657100
 p2: '#bytes'=0x1
 * time between wait #1 and #2: 0.000002 sec
 2. event: 'SQL*Net message to client'
 time waited: 0.000000 sec
 wait id: 8 p1: 'driver id'=0x62657100
 p2: '#bytes'=0x1
 * time between wait #2 and #3: 0.000021 sec
 3. event: 'SQL*Net message from client'
 time waited: 0.000815 sec
 wait id: 7 p1: 'driver id'=0x62657100
 p2: '#bytes'=0x1
 }
 and is blocked by
 => Oracle session identified by:
 {
 instance: 1 (+asm.+asm)
 os id: 5112446
 process id: 18, oracle@NG-CSDB (TNS V1-V3)
 session id: 1135
 session serial #: 1
 }
 which is waiting for 'rdbms ipc reply' with wait info:
 {
 p1: 'from_process'=0xd
 p2: 'timeout'=0x7fffff7b
 time in wait: 1.626263 sec
 timeout after: 0.373737 sec
 wait id: 86
 blocking: 2 sessions
 current sql: ALTER DISKGROUP UATDG MOUNT /* asm agent *//* {0:0:324} */
 short stack: ksedsts()+360<-ksdxfstk()+44<-ksdxcb()+3384<-sspuser()+116<-48b8<-sskgpwwait()+32<-skgpwwait()+180<-ksliwat()+11032<-kslwaitctx()+180<-kslwait()+112<-ksarcr()+348<-ksbwcoex()+136<-kfgbSendWithPin()+5<-kfgbSendShallow()+244<-kfgDiscoverShallow()+392<-kfgGlobalOpen()+388<-kfgDiscoverDeep()+332<-kfgDiscoverGroup()+1160<-kfgFinalizeMount()+1236<-kfgscFinalize()+1552<-kfgForEachKfgsc()+272<-kfgsoFinalize()+200<-kfgFinalize()+500<-kfxdrvMount()+3436<-kfxdrvEntry()+2296<-opiexe()+18568<-opiosq0()+28<-kpooprx()+316<-k
 wait history:
 * time between current wait and wait #1: 0.000040 sec
 1. event: 'rdbms ipc reply'
 time waited: 2.000011 sec
 wait id: 85 p1: 'from_process'=0xd
 p2: 'timeout'=0x7fffff7d
 * time between wait #1 and #2: 0.000045 sec
 2. event: 'rdbms ipc reply'
 time waited: 2.000012 sec
 wait id: 84 p1: 'from_process'=0xd
 p2: 'timeout'=0x7fffff7f
 * time between wait #2 and #3: 0.000038 sec
 3. event: 'rdbms ipc reply'
 time waited: 2.000010 sec
 wait id: 83 p1: 'from_process'=0xd
 p2: 'timeout'=0x7fffff81
 }
 and is blocked by
 => Oracle session identified by:
 {
 instance: 1 (+asm.+asm)
 os id: 4653624
 process id: 13, oracle@NG-CSDB (RBAL)
 session id: 820
 session serial #: 1
 }
 which is not in a wait:
 {
 last wait: 2 min 13 sec ago
 blocking: 3 sessions
 current sql: 
 short stack: ksedsts()+360<-ksdxfstk()+44<-ksdxcb()+3384<-sspuser()+116<-48b8<-skgfrdevstat()+444<-skgfrdscdevs()+1100<-ksfddscdevs()+24<-kfk_dscv_ufs_init()+208<-kfk_discover_disks()+5<-kfdDiscoverString()+100<-kfdDiscoverShallow()+17<-kfgbDriver()+1908<-ksbabs()+2800<-ksbrdp()+2044<-opirip()+1624<-opidrv()+608<-sou2o()+136<-opimai_real()+188<-ssthrdmain()+268<-main()+204<-__start()+112
 wait history:
 1. event: 'CSS operation: action'
 time waited: 0.0030 sec
 wait id: 3 p1: 'function_id'=0x43
 * time between wait #1 and #2: 0.000428 sec
 2. event: 'GPnP Get Item'
 time waited: 0.091329 sec
 wait id: 2 
 * time between wait #2 and #3: 0.003109 sec
 3. event: 'CSS initialization'
 time waited: 0.046782 sec
 wait id: 1 
 }

可以看到我们的会话在执行磁盘组mount的过程中,出现了等待事件enq: DD – contention,而这个进程被ASM的RBAL进程所阻塞,RBAL的历史等待情况是’CSS initialization’=>’GPnP Get Item’=>’CSS operation: action’,而hang分析文件里面也包含了RBAL进程的堆栈信息。一般遇到这种问题,我们需要把相关堆栈拿到support上做一下搜索,于是,我搜到了一篇文档,和我们等待事件还有堆栈信息如出一辙。文档内容如下:ASM Instance Hangs During The Diskgroup Mount Stage After AIX OS Patch Install (文档 ID 1633273.1)。我们来看一下该文档的一些说明情况。文档上说明这是bug 16288155导致的,但是在11.2.0.3上已经进行了修复。但是奇怪的是,我们的版本就是11.2.0.3的版本。文档上要我们对RBAL进程做一下truss信息搜集,如下所示:

1. Create a script called rbal.sh on /home/oracle directory. Add the following lines:
#!/bin/csh
/usr/bin/truss -o /tmp/rbal_strace.log -aefldD /bin/oracle
chmod +x /home/oracle/rbal.sh
2.
sqlplus / as sysasm
SQL>create pfile='/tmp/xx.ora' from spfile;
3. Edit xx.ora and add the following parameter
_dbg_proc_startup=TRUE
4. Shutdown ASM
5. Startup ASM using the following commands:
# csh
# setenv RBAL /home/oracle/rbal.sh
# sqlplus / as sysasm
# SQL> startup pfile='/tmp/xx.ora'

我们按照文档的要求对进程做了truss的收集,内容如下:

3866976: 20447683: 0.0002: statx("/oracle/app/grid_base/diag/asm/+asm/+ASM/alert/log.xml", 0x0FFFFFFFFFFF7FD0, 176, 01) = 0 
3866976: 20447683: 0.0002: statx("/oracle/app/grid_base/diag/asm/+asm/+ASM/alert/log.xml", 0x0FFFFFFFFFFF8100, 176, 0) = 0 
3866976: 20447683: 0.0004: kopen("/oracle/app/grid_base/diag/asm/+asm/+ASM/alert/log.xml", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP) = 7 
3866976: 20447683: 0.0002: kfcntl(7, F_SETFD, 0x0000000000000001) = 0 
3866976: 20447683: kwrite(7, " < m s g t i m e = ' 2".., 283) = 283 
3866976: 20447683: 0.0002: close(7) = 0 
3866976: 20447683: 0.0002: statx("/oracle/app/grid_base/diag/asm/+asm/+ASM/alert/log.xml", 0x0FFFFFFFFFFF7620, 176, 0) = 0 
3866976: 20447683: 0.0002: kopen("/oracle/app/grid_base/diag/asm/+asm/+ASM/trace/alert_+ASM.log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP) = 7 
3866976: 20447683: 0.0002: kfcntl(7, F_SETFD, 0x0000000000000001) = 0 
3866976: 20447683: kwrite(7, " W e d J u l 2 3 0".., 25) = 25 
3866976: 20447683: kwrite(7, " R B A L s t a r t e d".., 41) = 41 <<<<<<<<<< 
3866976: 20447683: 0.0002: close(7) = 0 
3866976: 20447683: 2.0001: _thread_wait(1152921504606830016) (sleeping...) 
3866976: 20447683: 2.0001: _thread_wait(1152921504606830016) = 1 
3866976: 20447683: 1.0002: kopen("/proc/4653566/psinfo", O_RDONLY|O_LARGEFILE) = 7 
3866976: 20447683: kread(7, "\0 \0010204\001\0\0\001".., 448) = 448 
3866976: 20447683: 0.0002: close(7) = 0 
3866976: 20447683: 2.0001: _thread_wait(1152921504606830016) (sleeping...) 
3866976: 20447683: 2.0001: _thread_wait(1152921504606830016) = 1 
3866976: 20447683: 2.0001: _thread_wait(1152921504606830016) (sleeping...) 
3866976: 20447683: 2.0001: _thread_wait(1152921504606830016) = 1 
3866976: 20447683: 2.0001: _thread_wait(1152921504606830016) (sleeping...) 

我们可以看到,RBAL已经启动了,然后就一直处于Sleeping的状态。证明RBAL进程没有什么问题。而MOS的文档是显示RBAL访问磁盘出现了问题。我们的情况是不一样的。那么究竟是什么问题?这个确实很诡异,我们只能根据这篇文章最后介绍的办法就是把os的patch回滚掉,然后重新relink软件。对于这个问题开了SR也没有什么特别的进展,怀疑是一个还没有发现的冲突。因为毕竟Oracle软件是先研发出来的,而IBM的新补丁是后开发出来的,这些本身就容易造成一些不兼容的情况。最好是把操作系统补丁都安装好了,然后再重装最新版的数据库软件。

参考文档:ASM Instance Hangs During The Diskgroup Mount Stage After AIX OS Patch Install (文档 ID 1633273.1)

下载本文
显示全文
专题