- 浏览: 978846 次
- 性别:
- 来自: 杭州
文章分类
最新评论
-
孤星119:
好熟悉的数据库字段啊, 上家公司做的项目每天都跟这些字段打招呼 ...
Oracle exp compress参数引起的空间浪费 -
itspace:
quxiaoyong 写道遇到个问题,网上一搜,全他妈这篇文章 ...
数据库连接错误ORA-28547 -
quxiaoyong:
遇到个问题,网上一搜,全他妈这篇文章。你转来转去的有意思吗?
数据库连接错误ORA-28547 -
hctech:
关于version count过高的问题,不知博主是否看过ey ...
某客户数据库性能诊断报告 -
itspace:
invalid 写道写的不错,我根据这个来安装,有点理解错误了 ...
AIX 配置vncserver
某客户rac数据库2号节点实例自动宕节点,以下为分析报告
一、现象回顾:
2号节点发生故障时,alert日志显示如下:
Thread 2 advanced to log sequence 77740 (LGWR switch)
Current log# 24 seq# 77740 mem# 0: /dev/rcrm4_rd2_91_2G
Current log# 24 seq# 77740 mem# 1: /dev/rcrm4_rd2_92_2G
Mon Mar 28 09:45:23 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:46:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:47:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:47:53 BEIST 2011
Thread 2 advanced to log sequence 77741 (LGWR switch)
Current log# 25 seq# 77741 mem# 0: /dev/rcrm5_rd2_101_2G
Current log# 25 seq# 77741 mem# 1: /dev/rcrm5_rd2_102_2G
Mon Mar 28 09:48:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:49:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:50:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:51:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:52:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:53:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:54:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:55:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:56:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:57:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:58:44 BEIST 2011
Received an instance abort message from instance 1 (reason 0x0)
Please check instance 1 alert and LMON trace files for detail.
LMD0: terminating instance due to error 481
Mon Mar 28 09:58:45 BEIST 2011
System state dump is made for local instance
System State dumped to trace file /opt/oracle/admin/crmdb/bdump/crmdb2_diag_3732132.trc
Mon Mar 28 09:58:45 BEIST 2011
Shutting down instance (abort)
License high water mark = 2290
从alert日志来看,二号节点主要表现为PMON进程不能获得latch(PMON failed to acquire latch),超时等待10分钟,之后接受到一号节点将二号节点宕机指令(Received an instance abort message from instance 1 (reason 0x0)),LMD0进程将实例abort(LMD0: terminating instance due to error 481),并在宕机之前系统自动做了systemstate dump(System state dump is made for local instance),宕机之前用户在2节点的会话数为2290个(License high water mark = 2290)。
采用ass.awk工具分析crmdb2_diag_3732132.trc可以得到如下结果:
Starting Systemstate 1
.........................................
Ass.Awk Version 1.0.9 - Processing crmdb2_diag_3732132.trc
System State 1
~~~~~~~~~~~~~~~~
1:
2: waiting for 'latch: object queue header operation'[Latch 700000d1fe74fd0]
3: last wait for 'DIAG idle wait'
4: waiting for 'rdbms ipc message'
5: last wait for 'ges lmd/lmses to freeze in rcfg - mrcvr'
6: last wait for 'ges remote message'
7: waiting for 'gc object scan'
8: last wait for 'gc object scan'
9: last wait for 'gc object scan'
10: last wait for 'gc object scan'
11: waiting for 'gc object scan'
12: last wait for 'gc object scan'
13: last wait for 'gc object scan'
14: last wait for 'gc object scan'
15: waiting for 'latch: object queue header operation'[Latch 700000d1fc80b20]
16: waiting for 'latch: cache buffers chains'[Latch 700000cdf26c988]
17: waiting for 'latch: object queue header operation'[Latch 700000d1fe74fd0]
18: waiting for 'latch: object queue header operation'[Latch 700000d1fe74fd0]
19: last wait for 'gc object scan'
20: last wait for 'gc object scan'
21: last wait for 'gc object scan'
22: waiting for 'latch: cache buffers chains'[Latch 700000cea8a6a80]
23: waiting for 'rdbms ipc message'
24: waiting for 'gcs drm freeze in enter server mode'
25: waiting for 'gcs drm freeze in enter server mode'
26: waiting for 'gcs drm freeze in enter server mode'
27: waiting for 'gcs drm freeze in enter server mode'
28: waiting for 'gcs drm freeze in enter server mode'
29: waiting for 'gcs drm freeze in enter server mode'
30: waiting for 'gcs drm freeze in enter server mode'
31: waiting for 'gcs drm freeze in enter server mode'
32: waiting for 'rdbms ipc message'
33: waiting for 'rdbms ipc message'
34: last wait for 'smon timer'
35: waiting for 'rdbms ipc message'
36: waiting for 'rdbms ipc message'
37: waiting for 'rdbms ipc message'
38: waiting for 'rdbms ipc message'
39:
40:
41: waiting for 'rdbms ipc message'
Blockers
~~~~~~~~
Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
systemstate.
Resource Holder State
Latch 700000d1fe74fd0 ??? Blocker
Latch 700000d1fc80b20 ??? Blocker
Latch 700000cdf26c988 ??? Blocker
Latch 700000cea8a6a80 ??? Blocker
Object Names
~~~~~~~~~~~~
Latch 700000d1fe74fd0 Child object queue header operation
Latch 700000d1fc80b20 holding (efd=4) 700000d1fc80b20 Child obj
Latch 700000cdf26c988 Child cache buffers chains
Latch 700000cea8a6a80 Child cache buffers chains
762178 Lines Processed.
进一步通过分析crmdb2_diag_3732132.trc文件(分析过程详见附录),进程之间有如下关系:
(1)当故障发生时,orapid=7占用着latch: object queue header operation,addr=700000d1fc80b20,orapid=8,10,11,12,13,14,15,19,21处于等待状态,且会话均为dead状态。
(2)当故障发生时,orapid=9,20占用着latch: object queue header operation,addr=700000d1fe74fd0,orapid=2,17,18处于等待状态,其中2进程为alive,17,18进程均为dead状态。
(3)当故障发生时orapid=16等待latch: cache buffers chains,addr=700000cdf26c988,进程为dead状态
(4)当故障发生时orapid=12等待latch: cache buffers chains,addr=700000cdf26c988,进程为dead状态。
(5)值得注意的是故障发生时dbwr进程均处于dead状态,等待事件如下:
Dumping Session Wait History
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488289 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488292 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488290 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488290 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488289 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488292 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488295 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488293 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488291 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488289 sec
故障发生时各日志文件显示如下:
(1)1号节点日志信息:
crmdb1_lmon_4579776.trc显示,当1号节点LMON进程检测LMD进程到存在异常时(超时900秒),即向2号节点发起请求(requesting memberkill of instances w/o ftdones),2号节点收到之后由LMD完成宕库操作。
*** 2011-03-28 09:58:43.010
* kjfcdrmrcfg: waited 901 secs (6501391,6502292,900) for lmd to receive all ftdones, requesting memberkill of instances w/o ftdones:
kjfsprn: sync status inst 0 tmout 900 (sec)
kjfsprn: sync propose inc 4 level 691378
kjfsprn: dmap ver 4 (step 0)
kjfsprn: sync inc 4 level 691378
kjfsprn: sync bitmap 0 1
DEFER MSG QUEUE ON LMD0 IS EMPTY
SEQUENCES:
0:0.0 1:210167409.0
alert_crmdb1.log显示:
* kjfc_kill_no_sync: send abort msg to node 1 (noftdn)
Mon Mar 28 09:58:45 BEIST 2011
(2)2号节点信息:
crmdb2_lmd0_2154764.trc显示:2号节点lmd进程收到指令,准备宕节点,宕节点之前(5秒钟)使用diag进程进行systemstate dump。
*** 2011-03-28 09:58:44.845
Received a special admin message (type 2) from node 0
Message body: flag 0x5 data 0x0 0x0 0x0
Abort the instance
ksuitm: waiting up to [5] seconds before killing DIAG(3732132)
二、故障分析
LMS(GLOBAL CACHE SERVICE PROCESS)进程主要用来管理集群内数据块的访问,并在不同实例的BUFFER CACHE中传输块镜像。LMS进程跨集群管理数据库的请求,并保证在所有实例的BUFFER CACHE中一个数据块的镜像只能出现一次。LMS进程靠着在实例中传递消息来协调数据块的访问,当一个实例请求数据块时,该实例的LMD进程发出一个数据块资源的请求,该请求只向MASTER数据块的实例的LMD进程,MASTER实例的LMD进程同时正在使用的实例的LMD进程释放该资源,这时拥有该资源的实例的LMS进程会创建一个数据块镜像的一致性读,然后把该数据块传递到请求该资源的实例的BUFFER CACHE中。LMS进程保证了在每一时刻只能允许一个实例去更新数据块,并负责保持该数据块的镜像纪录(包含更新数据块的状态FLAG)。在故障发生时,由进程状态关系图,可以看到LMS进程处于latch free(latch: object queue header operation和latch: cache buffers chains)且状态为dead状态(LMS0除外)。也就意味着1,2号节点之间block传输出现了问题。由于PMON进程和LMS进程相互争用latch(latch: object queue header operation),进而PMON获取不了latch,出现超时等待10分钟,也就无法恢复有问题的进程。LMD (GLOBAL ENQUEUE SERVICE DAEMON)进程主要管理对全局队列和资源的访问,并更新相应队列的状态,处理来自于其他实例的资源请求。当2号节点LMS进程出现问题时,无法更新相应队列状态,由日志可以看出2号节点处于空闲事件等待状态(ges remote message)LMON(GLOBAL ENQUEUE SERVICE MONITOR)进程主要监测群集内的全局队列和全局资源,管理实例和处理异常并相应的群集队列进行恢复操作。当1号节点LMON进程检测到2号节点出现异常时,发送killmember指令,由2号节点lmd进程完成abort操作。我们同时注意到故障发生时,ckpt,dbwr进程均处于dead状态,为保证一致性处于pending i/o状态,dbwr等待事件为gcs drm freeze in enter server mode,该等待事件在Oracle有如下解释(metalink doc:4755405.8):
Under heavy load, DRM sync can take very long as LMON is waiting for LMS
processes to empty their send queues. This wait is unnecessary as
correctness is already guaranteed by the sync channel flush mechanism.
This shows up as sessions having large "gcs drm freeze in enter server mode" wait times. DRM takes a long time (more than 5 minutes).
也就是在在高负载到情况下,尤其是需要节点之间大规模传递一致性块时,DRM sync功能LMON会等待 LMS进程清空发送队列,当然在LMS进程清空发送队列时可能会引起latch free等待,进而引起GES问题,而Oracle GES(GES包括LMON和LMD进程)检测到有问题时,将会abort实例(metalink:ID 9920699.8)。Oracle还指出DRM功能可能会超过5分钟。针对这一现象,Oracle定位为bug:6960699
三、解决办法:
屏蔽DRM功能,设置如下隐含参数
_gc_affinity_time=0
_gc_undo_affinity=FALSE
附录:
经过awk格式化之后的文件可以看出latch 700000d1fc80b20成为首号怀疑对象,进而在跟踪文件crmdb2_diag_3732132.trc寻找相关信息。
PROCESS 7:
----------------------------------------
SO: 700000d64f4f4c0, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=7, calls cur/top: 700000d69842360/700000d6983c508, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=20
holding (efd=4) 700000d1fc80b20 Child object queue header operation level=5 child#=77
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000009)
waiters [orapid (seconds since: put on list, posted, alive check)]:
15 (2, 1301277525, 2)
waiter count=1
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 3830616
OSD pid info: Unix process pid: 3830616, image: oracle@crmdb02 (LMS0)
从红色标注可以看出process=7(LMS0)进程holding child latch#=7,addr=700000d1fc80b20的object queue header operation锁,从而导致了process=15进程等待。
查看PROCESS=15状态:
PROCESS 15:
----------------------------------------
SO: 700000d8cf24760, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=15, calls cur/top: 700000d698433f8/700000d6983db38, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 98
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=0
Location from where call was made: kclobj_2:
waiting for 700000d1fc80b20 Child object queue header operation level=5 child#=77
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000009)
waiters [orapid (seconds since: put on list, posted, alive check)]:
15 (2, 1301277525, 2)
waiter count=1
gotten 688460698 times wait, failed first 3268358 sleeps 1757249
gotten 19572 times nowait, failed: 109
possible holder pid = 7 ospid=3830616
on wait list for 700000d1fc80b20
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 3260452 (DEAD)
OSD pid info: Unix process pid: 3260452, image: oracle@crmdb02 (LMS8)
从红色标注可以看出process=15(LMS8)进程正在等待child latch#=7,addr=700000d1fc80b20,且进程已处于DEAD状态。
用操作系统命令查看,发现只有一个会话等待此child latch
[ora10g@mcprod ~]$ cat crmdb2_diag_3732132.trc|grep "waiting for 700000d1fc80b20" |wc -l
1
进一步查看共有9个会话holding此child latch。
[ora10g@mcprod ~]$ cat crmdb2_diag_3732132.trc|grep "holding (efd=4) 700000d1fc80b20" |wc -l
9
查看PROCESS=8状态。
PROCESS 8:
----------------------------------------
SO: 700000d8cf23f70, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=8, calls cur/top: 700000d69842620/700000d6983c7c8, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 98
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=20
holding (efd=4) 700000d1fc80b20 Child object queue header operation level=5 child#=77
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000009)
waiters [orapid (seconds since: put on list, posted, alive check)]:
15 (2, 1301277525, 2)
waiter count=1
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 3503016 (DEAD)
OSD pid info: Unix process pid: 3503016, image: oracle@crmdb02 (LMS1)
可以看出process=8(LMS1)状态等同于process=15。
查看PROCESS=9,PROCESS=20状态
PROCESS 9:
----------------------------------------
SO: 700000d60f24a78, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=9, calls cur/top: 700000d698428e0/700000d6983caa0, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 98
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=20
holding (efd=4) 700000d1fe74fd0 Child object queue header operation level=5 child#=91
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000002)
waiters [orapid (seconds since: put on list, posted, alive check)]:
18 (2, 1301277525, 2)
17 (2, 1301277525, 2)
waiter count=2
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 2982786 (DEAD)
OSD pid info: Unix process pid: 2982786, image: oracle@crmdb02 (LMS2)
可以看到process=9(LMS2), process=20(LMSd),会话holding了addr=700000d1fe74fd0 child#91的child latch,进而导致了process=17(LMSa)和process=18(LMSb)等待。
查看PROCESS=17状态
PROCESS 17:
----------------------------------------
SO: 700000d5cf23598, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=17, calls cur/top: 700000d67b9d3f8/700000d6983e0d0, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 98
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=0
Location from where call was made: kclobj_2:
waiting for 700000d1fe74fd0 Child object queue header operation level=5 child#=91
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000002)
waiters [orapid (seconds since: put on list, posted, alive check)]:
18 (2, 1301277525, 2)
17 (2, 1301277525, 2)
waiter count=2
gotten 687558442 times wait, failed first 3419261 sleeps 1918111
gotten 19762 times nowait, failed: 151
possible holder pid = 9 ospid=2982786
on wait list for 700000d1fe74fd0
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 295048 (DEAD)
OSD pid info: Unix process pid: 295048, image: oracle@crmdb02 (LMSa)
查看PROCESS=18状态
PROCESS 18:
----------------------------------------
SO: 700000d58f42690, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=18, calls cur/top: 700000d69843f10/700000d6983e390, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 98
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=0
Location from where call was made: kclobj_2:
waiting for 700000d1fe74fd0 Child object queue header operation level=5 child#=91
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000002)
waiters [orapid (seconds since: put on list, posted, alive check)]:
18 (2, 1301277525, 2)
17 (2, 1301277525, 2)
waiter count=2
gotten 687558442 times wait, failed first 3419261 sleeps 1918111
gotten 19762 times nowait, failed: 151
possible holder pid = 9 ospid=2982786
on wait list for 700000d1fe74fd0
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 3428506
OSD pid info: Unix process pid: 3428506, image: oracle@crmdb02 (LMSb)
可以看到process=17,18均等待addr=700000d1fe74fd0 child latch。
PROCESS=10(LMS3),PROCESS=11(LMS4),PROCESS=12(LMS5),PORCESS=13(LMS6),PROCESS=14(LMS7),PROCESS=15(LMS8),PROCESS=19(LMSc),PROCESS=21(LMSe)状态均为类似状态,即进程holding child latch#=7,addr=700000d1fc80b20的object queue header operation锁,从而导致了process=15进程等待。
查看PROCESS=16状态
PROCESS 16:
----------------------------------------
SO: 700000d60f25268, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=16, calls cur/top: 700000d69843990/700000d6983ddf8, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 98
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=0
Location from where call was made: kclgrantlk_2:
waiting for 700000cdf26c988 Child cache buffers chains level=1 child#=162223
Location from where latch is held: kcbchg: kslbegin: bufs not pinned:
Context saved from call: 562086705
state=busy(exclusive) (val=0x200000000000086f) holder orapid = 2159
waiters [orapid (seconds since: put on list, posted, alive check)]:
16 (140, 1301277525, 2)
waiter count=1
gotten 4188007 times wait, failed first 98161 sleeps 554
gotten 461256 times nowait, failed: 124
possible holder pid = 2159 ospid=3392358
on wait list for 700000cdf26c988
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 3560010 (DEAD)
查看process=22状态
PROCESS 22:
----------------------------------------
SO: 700000d8cf24f50, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=22, calls cur/top: 700000d67b9dc38/700000d6983eea8, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 135
last post sent-location: kclcget
last process posted by me: 700000d60f53498 3 1
(latch info) wait_event=0 bits=0
Location from where call was made: kclexpandlock:
waiting for 700000cea8a6a80 Child cache buffers chains level=1 child#=61810
Location from where latch is held: kcbzgb: scan from tail. nowait:
Context saved from call: 0
state=busy(exclusive) (val=0x2000000000000456) holder orapid = 1110
waiters [orapid (seconds since: put on list, posted, alive check)]:
22 (279, 1301277525, 2)
waiter count=1
gotten 5121038 times wait, failed first 15184 sleeps 753
gotten 586404 times nowait, failed: 1427022
possible holder pid = 1110 ospid=2978810
on wait list for 700000cea8a6a80
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 3109298 (DEAD)
OSD pid info: Unix process pid: 3109298, image: oracle@crmdb02 (LMSf)
最后检查process=2状态
PROCESS 2:
----------------------------------------
SO: 700000d60f24288, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=2, calls cur/top: 700000d6983b730/700000d6983b730, flag: (e) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 504403214633274144 133 2
last post received-location: kslges
last process to post me: 700000d5cf85938 137 1
last post sent: 0 0 121
last post sent-location: kcbzww
last process posted by me: 700000d58f7f6e0 198 1
(latch info) wait_event=0 bits=6
Location from where call was made: kcbw_unlink_q:
waiting for 700000d1fe74fd0 Child object queue header operation level=5 child#=91
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000002)
waiters [orapid (seconds since: put on list, posted, alive check)]:
18 (2, 1301277525, 2)
17 (2, 1301277525, 2)
waiter count=2
gotten 687558442 times wait, failed first 3419261 sleeps 1918111
gotten 19762 times nowait, failed: 151
possible holder pid = 9 ospid=2982786
holding (efd=8) 700000cf3a6e170 Child cache buffers lru chain level=2 child#=91
Location from where latch is held: kcbzfw: buffer header
Context saved from call: 0
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
63 (50, 1301277525, 2)
waiter count=1
holding (efd=8) 700000ce2a79ef0 Child cache buffers chains level=1 child#=81019
Location from where latch is held: kcbso1: set no access:
Context saved from call: 0
state=busy(exclusive) (val=0x2000000000000002) holder orapid = 2
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 1274566
OSD pid info: Unix process pid: 1274566, image: oracle@crmdb02 (PMON)
一、现象回顾:
2号节点发生故障时,alert日志显示如下:
Thread 2 advanced to log sequence 77740 (LGWR switch)
Current log# 24 seq# 77740 mem# 0: /dev/rcrm4_rd2_91_2G
Current log# 24 seq# 77740 mem# 1: /dev/rcrm4_rd2_92_2G
Mon Mar 28 09:45:23 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:46:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:47:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:47:53 BEIST 2011
Thread 2 advanced to log sequence 77741 (LGWR switch)
Current log# 25 seq# 77741 mem# 0: /dev/rcrm5_rd2_101_2G
Current log# 25 seq# 77741 mem# 1: /dev/rcrm5_rd2_102_2G
Mon Mar 28 09:48:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:49:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:50:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:51:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:52:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:53:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:54:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:55:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:56:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:57:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:58:44 BEIST 2011
Received an instance abort message from instance 1 (reason 0x0)
Please check instance 1 alert and LMON trace files for detail.
LMD0: terminating instance due to error 481
Mon Mar 28 09:58:45 BEIST 2011
System state dump is made for local instance
System State dumped to trace file /opt/oracle/admin/crmdb/bdump/crmdb2_diag_3732132.trc
Mon Mar 28 09:58:45 BEIST 2011
Shutting down instance (abort)
License high water mark = 2290
从alert日志来看,二号节点主要表现为PMON进程不能获得latch(PMON failed to acquire latch),超时等待10分钟,之后接受到一号节点将二号节点宕机指令(Received an instance abort message from instance 1 (reason 0x0)),LMD0进程将实例abort(LMD0: terminating instance due to error 481),并在宕机之前系统自动做了systemstate dump(System state dump is made for local instance),宕机之前用户在2节点的会话数为2290个(License high water mark = 2290)。
采用ass.awk工具分析crmdb2_diag_3732132.trc可以得到如下结果:
Starting Systemstate 1
.........................................
Ass.Awk Version 1.0.9 - Processing crmdb2_diag_3732132.trc
System State 1
~~~~~~~~~~~~~~~~
1:
2: waiting for 'latch: object queue header operation'[Latch 700000d1fe74fd0]
3: last wait for 'DIAG idle wait'
4: waiting for 'rdbms ipc message'
5: last wait for 'ges lmd/lmses to freeze in rcfg - mrcvr'
6: last wait for 'ges remote message'
7: waiting for 'gc object scan'
8: last wait for 'gc object scan'
9: last wait for 'gc object scan'
10: last wait for 'gc object scan'
11: waiting for 'gc object scan'
12: last wait for 'gc object scan'
13: last wait for 'gc object scan'
14: last wait for 'gc object scan'
15: waiting for 'latch: object queue header operation'[Latch 700000d1fc80b20]
16: waiting for 'latch: cache buffers chains'[Latch 700000cdf26c988]
17: waiting for 'latch: object queue header operation'[Latch 700000d1fe74fd0]
18: waiting for 'latch: object queue header operation'[Latch 700000d1fe74fd0]
19: last wait for 'gc object scan'
20: last wait for 'gc object scan'
21: last wait for 'gc object scan'
22: waiting for 'latch: cache buffers chains'[Latch 700000cea8a6a80]
23: waiting for 'rdbms ipc message'
24: waiting for 'gcs drm freeze in enter server mode'
25: waiting for 'gcs drm freeze in enter server mode'
26: waiting for 'gcs drm freeze in enter server mode'
27: waiting for 'gcs drm freeze in enter server mode'
28: waiting for 'gcs drm freeze in enter server mode'
29: waiting for 'gcs drm freeze in enter server mode'
30: waiting for 'gcs drm freeze in enter server mode'
31: waiting for 'gcs drm freeze in enter server mode'
32: waiting for 'rdbms ipc message'
33: waiting for 'rdbms ipc message'
34: last wait for 'smon timer'
35: waiting for 'rdbms ipc message'
36: waiting for 'rdbms ipc message'
37: waiting for 'rdbms ipc message'
38: waiting for 'rdbms ipc message'
39:
40:
41: waiting for 'rdbms ipc message'
Blockers
~~~~~~~~
Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
systemstate.
Resource Holder State
Latch 700000d1fe74fd0 ??? Blocker
Latch 700000d1fc80b20 ??? Blocker
Latch 700000cdf26c988 ??? Blocker
Latch 700000cea8a6a80 ??? Blocker
Object Names
~~~~~~~~~~~~
Latch 700000d1fe74fd0 Child object queue header operation
Latch 700000d1fc80b20 holding (efd=4) 700000d1fc80b20 Child obj
Latch 700000cdf26c988 Child cache buffers chains
Latch 700000cea8a6a80 Child cache buffers chains
762178 Lines Processed.
进一步通过分析crmdb2_diag_3732132.trc文件(分析过程详见附录),进程之间有如下关系:
(1)当故障发生时,orapid=7占用着latch: object queue header operation,addr=700000d1fc80b20,orapid=8,10,11,12,13,14,15,19,21处于等待状态,且会话均为dead状态。
(2)当故障发生时,orapid=9,20占用着latch: object queue header operation,addr=700000d1fe74fd0,orapid=2,17,18处于等待状态,其中2进程为alive,17,18进程均为dead状态。
(3)当故障发生时orapid=16等待latch: cache buffers chains,addr=700000cdf26c988,进程为dead状态
(4)当故障发生时orapid=12等待latch: cache buffers chains,addr=700000cdf26c988,进程为dead状态。
(5)值得注意的是故障发生时dbwr进程均处于dead状态,等待事件如下:
Dumping Session Wait History
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488289 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488292 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488290 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488290 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488289 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488292 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488295 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488293 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488291 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488289 sec
故障发生时各日志文件显示如下:
(1)1号节点日志信息:
crmdb1_lmon_4579776.trc显示,当1号节点LMON进程检测LMD进程到存在异常时(超时900秒),即向2号节点发起请求(requesting memberkill of instances w/o ftdones),2号节点收到之后由LMD完成宕库操作。
*** 2011-03-28 09:58:43.010
* kjfcdrmrcfg: waited 901 secs (6501391,6502292,900) for lmd to receive all ftdones, requesting memberkill of instances w/o ftdones:
kjfsprn: sync status inst 0 tmout 900 (sec)
kjfsprn: sync propose inc 4 level 691378
kjfsprn: dmap ver 4 (step 0)
kjfsprn: sync inc 4 level 691378
kjfsprn: sync bitmap 0 1
DEFER MSG QUEUE ON LMD0 IS EMPTY
SEQUENCES:
0:0.0 1:210167409.0
alert_crmdb1.log显示:
* kjfc_kill_no_sync: send abort msg to node 1 (noftdn)
Mon Mar 28 09:58:45 BEIST 2011
(2)2号节点信息:
crmdb2_lmd0_2154764.trc显示:2号节点lmd进程收到指令,准备宕节点,宕节点之前(5秒钟)使用diag进程进行systemstate dump。
*** 2011-03-28 09:58:44.845
Received a special admin message (type 2) from node 0
Message body: flag 0x5 data 0x0 0x0 0x0
Abort the instance
ksuitm: waiting up to [5] seconds before killing DIAG(3732132)
二、故障分析
LMS(GLOBAL CACHE SERVICE PROCESS)进程主要用来管理集群内数据块的访问,并在不同实例的BUFFER CACHE中传输块镜像。LMS进程跨集群管理数据库的请求,并保证在所有实例的BUFFER CACHE中一个数据块的镜像只能出现一次。LMS进程靠着在实例中传递消息来协调数据块的访问,当一个实例请求数据块时,该实例的LMD进程发出一个数据块资源的请求,该请求只向MASTER数据块的实例的LMD进程,MASTER实例的LMD进程同时正在使用的实例的LMD进程释放该资源,这时拥有该资源的实例的LMS进程会创建一个数据块镜像的一致性读,然后把该数据块传递到请求该资源的实例的BUFFER CACHE中。LMS进程保证了在每一时刻只能允许一个实例去更新数据块,并负责保持该数据块的镜像纪录(包含更新数据块的状态FLAG)。在故障发生时,由进程状态关系图,可以看到LMS进程处于latch free(latch: object queue header operation和latch: cache buffers chains)且状态为dead状态(LMS0除外)。也就意味着1,2号节点之间block传输出现了问题。由于PMON进程和LMS进程相互争用latch(latch: object queue header operation),进而PMON获取不了latch,出现超时等待10分钟,也就无法恢复有问题的进程。LMD (GLOBAL ENQUEUE SERVICE DAEMON)进程主要管理对全局队列和资源的访问,并更新相应队列的状态,处理来自于其他实例的资源请求。当2号节点LMS进程出现问题时,无法更新相应队列状态,由日志可以看出2号节点处于空闲事件等待状态(ges remote message)LMON(GLOBAL ENQUEUE SERVICE MONITOR)进程主要监测群集内的全局队列和全局资源,管理实例和处理异常并相应的群集队列进行恢复操作。当1号节点LMON进程检测到2号节点出现异常时,发送killmember指令,由2号节点lmd进程完成abort操作。我们同时注意到故障发生时,ckpt,dbwr进程均处于dead状态,为保证一致性处于pending i/o状态,dbwr等待事件为gcs drm freeze in enter server mode,该等待事件在Oracle有如下解释(metalink doc:4755405.8):
Under heavy load, DRM sync can take very long as LMON is waiting for LMS
processes to empty their send queues. This wait is unnecessary as
correctness is already guaranteed by the sync channel flush mechanism.
This shows up as sessions having large "gcs drm freeze in enter server mode" wait times. DRM takes a long time (more than 5 minutes).
也就是在在高负载到情况下,尤其是需要节点之间大规模传递一致性块时,DRM sync功能LMON会等待 LMS进程清空发送队列,当然在LMS进程清空发送队列时可能会引起latch free等待,进而引起GES问题,而Oracle GES(GES包括LMON和LMD进程)检测到有问题时,将会abort实例(metalink:ID 9920699.8)。Oracle还指出DRM功能可能会超过5分钟。针对这一现象,Oracle定位为bug:6960699
三、解决办法:
屏蔽DRM功能,设置如下隐含参数
_gc_affinity_time=0
_gc_undo_affinity=FALSE
附录:
经过awk格式化之后的文件可以看出latch 700000d1fc80b20成为首号怀疑对象,进而在跟踪文件crmdb2_diag_3732132.trc寻找相关信息。
PROCESS 7:
----------------------------------------
SO: 700000d64f4f4c0, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=7, calls cur/top: 700000d69842360/700000d6983c508, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=20
holding (efd=4) 700000d1fc80b20 Child object queue header operation level=5 child#=77
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000009)
waiters [orapid (seconds since: put on list, posted, alive check)]:
15 (2, 1301277525, 2)
waiter count=1
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 3830616
OSD pid info: Unix process pid: 3830616, image: oracle@crmdb02 (LMS0)
从红色标注可以看出process=7(LMS0)进程holding child latch#=7,addr=700000d1fc80b20的object queue header operation锁,从而导致了process=15进程等待。
查看PROCESS=15状态:
PROCESS 15:
----------------------------------------
SO: 700000d8cf24760, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=15, calls cur/top: 700000d698433f8/700000d6983db38, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 98
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=0
Location from where call was made: kclobj_2:
waiting for 700000d1fc80b20 Child object queue header operation level=5 child#=77
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000009)
waiters [orapid (seconds since: put on list, posted, alive check)]:
15 (2, 1301277525, 2)
waiter count=1
gotten 688460698 times wait, failed first 3268358 sleeps 1757249
gotten 19572 times nowait, failed: 109
possible holder pid = 7 ospid=3830616
on wait list for 700000d1fc80b20
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 3260452 (DEAD)
OSD pid info: Unix process pid: 3260452, image: oracle@crmdb02 (LMS8)
从红色标注可以看出process=15(LMS8)进程正在等待child latch#=7,addr=700000d1fc80b20,且进程已处于DEAD状态。
用操作系统命令查看,发现只有一个会话等待此child latch
[ora10g@mcprod ~]$ cat crmdb2_diag_3732132.trc|grep "waiting for 700000d1fc80b20" |wc -l
1
进一步查看共有9个会话holding此child latch。
[ora10g@mcprod ~]$ cat crmdb2_diag_3732132.trc|grep "holding (efd=4) 700000d1fc80b20" |wc -l
9
查看PROCESS=8状态。
PROCESS 8:
----------------------------------------
SO: 700000d8cf23f70, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=8, calls cur/top: 700000d69842620/700000d6983c7c8, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 98
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=20
holding (efd=4) 700000d1fc80b20 Child object queue header operation level=5 child#=77
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000009)
waiters [orapid (seconds since: put on list, posted, alive check)]:
15 (2, 1301277525, 2)
waiter count=1
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 3503016 (DEAD)
OSD pid info: Unix process pid: 3503016, image: oracle@crmdb02 (LMS1)
可以看出process=8(LMS1)状态等同于process=15。
查看PROCESS=9,PROCESS=20状态
PROCESS 9:
----------------------------------------
SO: 700000d60f24a78, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=9, calls cur/top: 700000d698428e0/700000d6983caa0, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 98
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=20
holding (efd=4) 700000d1fe74fd0 Child object queue header operation level=5 child#=91
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000002)
waiters [orapid (seconds since: put on list, posted, alive check)]:
18 (2, 1301277525, 2)
17 (2, 1301277525, 2)
waiter count=2
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 2982786 (DEAD)
OSD pid info: Unix process pid: 2982786, image: oracle@crmdb02 (LMS2)
可以看到process=9(LMS2), process=20(LMSd),会话holding了addr=700000d1fe74fd0 child#91的child latch,进而导致了process=17(LMSa)和process=18(LMSb)等待。
查看PROCESS=17状态
PROCESS 17:
----------------------------------------
SO: 700000d5cf23598, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=17, calls cur/top: 700000d67b9d3f8/700000d6983e0d0, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 98
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=0
Location from where call was made: kclobj_2:
waiting for 700000d1fe74fd0 Child object queue header operation level=5 child#=91
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000002)
waiters [orapid (seconds since: put on list, posted, alive check)]:
18 (2, 1301277525, 2)
17 (2, 1301277525, 2)
waiter count=2
gotten 687558442 times wait, failed first 3419261 sleeps 1918111
gotten 19762 times nowait, failed: 151
possible holder pid = 9 ospid=2982786
on wait list for 700000d1fe74fd0
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 295048 (DEAD)
OSD pid info: Unix process pid: 295048, image: oracle@crmdb02 (LMSa)
查看PROCESS=18状态
PROCESS 18:
----------------------------------------
SO: 700000d58f42690, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=18, calls cur/top: 700000d69843f10/700000d6983e390, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 98
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=0
Location from where call was made: kclobj_2:
waiting for 700000d1fe74fd0 Child object queue header operation level=5 child#=91
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000002)
waiters [orapid (seconds since: put on list, posted, alive check)]:
18 (2, 1301277525, 2)
17 (2, 1301277525, 2)
waiter count=2
gotten 687558442 times wait, failed first 3419261 sleeps 1918111
gotten 19762 times nowait, failed: 151
possible holder pid = 9 ospid=2982786
on wait list for 700000d1fe74fd0
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 3428506
OSD pid info: Unix process pid: 3428506, image: oracle@crmdb02 (LMSb)
可以看到process=17,18均等待addr=700000d1fe74fd0 child latch。
PROCESS=10(LMS3),PROCESS=11(LMS4),PROCESS=12(LMS5),PORCESS=13(LMS6),PROCESS=14(LMS7),PROCESS=15(LMS8),PROCESS=19(LMSc),PROCESS=21(LMSe)状态均为类似状态,即进程holding child latch#=7,addr=700000d1fc80b20的object queue header operation锁,从而导致了process=15进程等待。
查看PROCESS=16状态
PROCESS 16:
----------------------------------------
SO: 700000d60f25268, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=16, calls cur/top: 700000d69843990/700000d6983ddf8, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 98
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=0
Location from where call was made: kclgrantlk_2:
waiting for 700000cdf26c988 Child cache buffers chains level=1 child#=162223
Location from where latch is held: kcbchg: kslbegin: bufs not pinned:
Context saved from call: 562086705
state=busy(exclusive) (val=0x200000000000086f) holder orapid = 2159
waiters [orapid (seconds since: put on list, posted, alive check)]:
16 (140, 1301277525, 2)
waiter count=1
gotten 4188007 times wait, failed first 98161 sleeps 554
gotten 461256 times nowait, failed: 124
possible holder pid = 2159 ospid=3392358
on wait list for 700000cdf26c988
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 3560010 (DEAD)
查看process=22状态
PROCESS 22:
----------------------------------------
SO: 700000d8cf24f50, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=22, calls cur/top: 700000d67b9dc38/700000d6983eea8, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 135
last post sent-location: kclcget
last process posted by me: 700000d60f53498 3 1
(latch info) wait_event=0 bits=0
Location from where call was made: kclexpandlock:
waiting for 700000cea8a6a80 Child cache buffers chains level=1 child#=61810
Location from where latch is held: kcbzgb: scan from tail. nowait:
Context saved from call: 0
state=busy(exclusive) (val=0x2000000000000456) holder orapid = 1110
waiters [orapid (seconds since: put on list, posted, alive check)]:
22 (279, 1301277525, 2)
waiter count=1
gotten 5121038 times wait, failed first 15184 sleeps 753
gotten 586404 times nowait, failed: 1427022
possible holder pid = 1110 ospid=2978810
on wait list for 700000cea8a6a80
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 3109298 (DEAD)
OSD pid info: Unix process pid: 3109298, image: oracle@crmdb02 (LMSf)
最后检查process=2状态
PROCESS 2:
----------------------------------------
SO: 700000d60f24288, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=2, calls cur/top: 700000d6983b730/700000d6983b730, flag: (e) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 504403214633274144 133 2
last post received-location: kslges
last process to post me: 700000d5cf85938 137 1
last post sent: 0 0 121
last post sent-location: kcbzww
last process posted by me: 700000d58f7f6e0 198 1
(latch info) wait_event=0 bits=6
Location from where call was made: kcbw_unlink_q:
waiting for 700000d1fe74fd0 Child object queue header operation level=5 child#=91
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000002)
waiters [orapid (seconds since: put on list, posted, alive check)]:
18 (2, 1301277525, 2)
17 (2, 1301277525, 2)
waiter count=2
gotten 687558442 times wait, failed first 3419261 sleeps 1918111
gotten 19762 times nowait, failed: 151
possible holder pid = 9 ospid=2982786
holding (efd=8) 700000cf3a6e170 Child cache buffers lru chain level=2 child#=91
Location from where latch is held: kcbzfw: buffer header
Context saved from call: 0
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
63 (50, 1301277525, 2)
waiter count=1
holding (efd=8) 700000ce2a79ef0 Child cache buffers chains level=1 child#=81019
Location from where latch is held: kcbso1: set no access:
Context saved from call: 0
state=busy(exclusive) (val=0x2000000000000002) holder orapid = 2
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 1274566
OSD pid info: Unix process pid: 1274566, image: oracle@crmdb02 (PMON)
发表评论
-
buffer cache 的内部结构
2020-03-18 14:21 579BUFFER CACHE作为数据块的 ... -
Oracle OMC介绍
2020-03-18 13:19 487Oracle管理云服务(OMC)的大数据平台,自动收集的企业 ... -
参加Oracle勒索病毒防范专题培训会议
2019-09-27 17:15 5132019年7月22日,受邀参加Oracle勒索病毒防范专题培训 ... -
记一次内存换IO的Oracle优化
2019-09-27 16:50 827某客户数据库从P595物理 ... -
如何定位Oracle SQL执行计划变化的原因
2019-07-03 14:49 1460性能优化最难的是能够 ... -
如何定位Oracle SQL执行计划变化的原因
2018-10-30 09:24 1185性能优化最难的是能够 ... -
数据库性能优化目标
2018-10-08 10:59 518从数据库性能优化的场 ... -
数据库无法打开的原因及解决办法
2018-10-05 20:45 2120数据库的启动是一个相当复杂的过程。比如,Oracle在启动之前 ... -
怎么样彻底删除数据库?
2018-09-18 11:10 599Oracle提供了drop database命令用来删除数据库 ... -
Oracle减少日志量的方法
2018-09-10 10:17 867LGWR进程将LOG BUFFER中的 ... -
如何快速关闭数据库
2018-09-09 13:14 1233“一朝被蛇咬,十年怕井绳”。在没被“蛇”咬之前,很多DBA喜欢 ... -
关于《如何落地智能化运维》PPT
2018-05-17 10:19 1129在DTCC 2018发表《如何落地智能化运维》演讲,主要内容如 ... -
记录在redhat5.8平台安装oracle11.2容易忽视的几个问题
2018-05-11 19:58 578问题一:ping不通问题 在虚拟机上安装好linux系统后, ... -
《Oracle DBA实战攻略》第一章
2018-05-11 10:42 947即日起,不定期更新《OracleDBA实战攻略》一书电子版,请 ... -
Oracle 12c新特性
2018-05-11 10:33 900查询所有pdb [oracle@gj4 ~]$ sqlplu ... -
关于修改memory_target的值后数据库无法启动的问题
2017-02-28 12:24 3983操作系统:RHEL6.5 数据库版本:11.2.0.4 ... -
10g rac安装error while loading shared libraries libpthread.so.0 问题
2017-02-28 12:22 69511g rac安装在二节点跑脚本一般会报此错误: 解决这个问 ... -
记一次Oracle会话共享模式故障处理过程
2017-02-27 19:16 801故障简述 XXX第八人民医院HIS数据库7月13日11点左右从 ... -
RESMGR:cpu quantum等待事件处理过程
2017-02-27 18:23 2615由于数据库上线过程中出现大量的RESMGR:cpu quant ... -
谈谈log file sync
2014-03-19 14:18 1759数据库中的log file sync等待事件指的是,当user ...
相关推荐
### 由重启引起的Oracle RAC节点宕机分析及追根溯源 #### 1. 背景介绍 本文档详细探讨了一个省级电信业务系统中,因频繁重启导致Oracle Real Application Clusters (RAC) 节点出现故障的情况。该系统在四套配置...
RAC 节点宕机故障分析是指在 Oracle Real Application Clusters(RAC)环境中,节点宕机故障的诊断和解决方法。在这个主题中,我们将重点介绍 RAC 节点宕机故障的分析和解决方法,涵盖 ALERT 告警文件分析、事件分析...
在 RAC-1 节点宕机,RAC-2 节点正常的情况下,需要在 RAC-2 节点上删除 RAC-1 节点信息。删除步骤如下: 1. 查看 RAC-2 节点数据库状态,确保一切正常。 2. 在 RAC-2 节点上,检查 pinned 状态,如果显示 unpinned...
当一个节点与集群中的其他节点失去联系超过Misscount设置的时间时,集群将认为该节点已经宕机。 1. **默认值**:默认情况下,Misscount的值为5秒。 2. **应用场景**:在高延迟网络环境中,可能需要适当增加...
### OracleRAC在VMWARE双机模式下的测试分析 #### 一、测试背景与目标 在当前企业级应用环境中,Oracle RAC (Real Application Clusters) 技术因其高可用性和可扩展性而受到广泛欢迎。然而,在物理服务器环境下...
oracle RAC 在IBM小型机上的测试报告,双节点RAC的各项测试内容,单双节点的RAC性能测试内容,三节点RAC的RAC性能测试内容,RMAN测试,exp,imp测试,大批量呼叫数据测试,大批量查询测试,大事物测试,load balance...
每个节点通过心跳机制检测彼此的状态,当心跳中断,节点可能会错误地认为其他节点已宕机,从而尝试接管整个集群。为解决这个问题,Oracle RAC采用了投票算法(Quorum Algorithm)。通过计算每个节点的票数(每个节点...
相比于传统的双机热备,RAC能够在几秒钟内完成故障切换,显著降低了宕机时间。此外,RAC的可扩展性使得用户可以根据业务需求动态添加或删除节点,从而轻松扩展硬件资源。 然而,Oracle RAC也存在一定的缺点。首先,...
但是,在某些情况下,节点之间的心跳机制可能会出现问题,导致每个节点认为其他节点宕机了,从而导致数据灾难。这种情况称为脑裂。为了解决脑裂问题,Oracle RAC 采用投票算法(Quorum Algorithm),通过投票来确定...
- **重启单个RAC节点模拟测试**:在RAC环境中,当一个节点意外宕机或需要维护时,服务应该能够无缝地切换到其他节点继续运行。测试此场景是为了确保系统的容错能力,确保业务连续性不受影响。 2. **网络故障模拟...
- **常见问题分析**:如网络连接中断、节点宕机等情况的诊断方法。 - **应急措施**:提供故障转移方案,确保业务连续性不受影响。 - **案例分享**:结合实际案例讲解如何快速定位问题根源并采取有效措施。 #### 七...
当某个数据库节点宕机时,无需人工干预,其他节点将继续工作,保证业务连续性。 2. **强扩展性**:RAC允许轻松添加或移除节点以适应系统需求变化,实现动态扩展。 3. **高容量**:随着节点数量的增加,RAC的整体...
- **虚拟IP (VIP)**:RAC集群中每个节点都有自己的VIP,当节点宕机时,VIP会漂移到其他节点,从而保证客户端可以持续访问数据库服务。 - **私有IP**:用于集群内部节点间的心跳同步。私有IP不对外可见,是集群内部...
1.主备数据库:包括主数据库和备用数据库,备用数据库可以在主数据库宕机时接管主数据库的工作。 2.负载均衡:包括负载均衡算法、会话保持、负载均衡策略等。 3.故障恢复:包括故障检测、故障恢复、自动恢复等。 五...
2015-09-12 06:51:03.598: [ CRSEVT][11304]32CAAMonitorHandler :: 0:Could not join /opt/oracle/product/10.2/crs/bin/racgwrap(check) category: 1234, operation: scls_process_join, loc: childcrash, OS ...
Standby 可以提供热备用机制,可以在主数据库宕机时快速切换到备用数据库。 Oracle 10g RAC 安装配置需要注意系统环境检查、存储准备、CRS 安装升级、数据库软件安装升级等几个方面。只有按照正确的顺序和步骤进行...
Standby 机制可以在主数据库宕机时,自动切换到备用数据库,确保数据库的高可用性。Oracle 10g RAC 和 Standby 的结合可以提供高性能、可靠性和高可用性的数据库解决方案。 Oracle 10g RAC 安装 Oracle 10g RAC ...