`
itspace
  • 浏览: 981983 次
  • 性别: Icon_minigender_1
  • 来自: 杭州
社区版块
存档分类
最新评论

Oracle rac宕机分析报告

阅读更多
某客户rac数据库2号节点实例自动宕节点,以下为分析报告

一、现象回顾:
2号节点发生故障时,alert日志显示如下:
Thread 2 advanced to log sequence 77740 (LGWR switch)
  Current log# 24 seq# 77740 mem# 0: /dev/rcrm4_rd2_91_2G
  Current log# 24 seq# 77740 mem# 1: /dev/rcrm4_rd2_92_2G
Mon Mar 28 09:45:23 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:46:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:47:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:47:53 BEIST 2011
Thread 2 advanced to log sequence 77741 (LGWR switch)
  Current log# 25 seq# 77741 mem# 0: /dev/rcrm5_rd2_101_2G
  Current log# 25 seq# 77741 mem# 1: /dev/rcrm5_rd2_102_2G
Mon Mar 28 09:48:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:49:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:50:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:51:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:52:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:53:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:54:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:55:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:56:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:57:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:58:44 BEIST 2011
Received an instance abort message from instance 1 (reason 0x0)
Please check instance 1 alert and LMON trace files for detail.
LMD0: terminating instance due to error 481
Mon Mar 28 09:58:45 BEIST 2011
System state dump is made for local instance
System State dumped to trace file /opt/oracle/admin/crmdb/bdump/crmdb2_diag_3732132.trc
Mon Mar 28 09:58:45 BEIST 2011
Shutting down instance (abort)
License high water mark = 2290


从alert日志来看,二号节点主要表现为PMON进程不能获得latch(PMON failed to acquire latch),超时等待10分钟,之后接受到一号节点将二号节点宕机指令(Received an instance abort message from instance 1 (reason 0x0)),LMD0进程将实例abort(LMD0: terminating instance due to error 481),并在宕机之前系统自动做了systemstate dump(System state dump is made for local instance),宕机之前用户在2节点的会话数为2290个(License high water mark = 2290)。

采用ass.awk工具分析crmdb2_diag_3732132.trc可以得到如下结果:
Starting Systemstate 1
.........................................
Ass.Awk Version 1.0.9 - Processing crmdb2_diag_3732132.trc

System State 1
~~~~~~~~~~~~~~~~
1:                                     
2:  waiting for 'latch: object queue header operation'[Latch 700000d1fe74fd0]
3:  last wait for 'DIAG idle wait'     
4:  waiting for 'rdbms ipc message'    
5:  last wait for 'ges lmd/lmses to freeze in rcfg - mrcvr'
6:  last wait for 'ges remote message' 
7:  waiting for 'gc object scan'       
8:  last wait for 'gc object scan'     
9:  last wait for 'gc object scan'     
10: last wait for 'gc object scan'     
11: waiting for 'gc object scan'       
12: last wait for 'gc object scan'     
13: last wait for 'gc object scan'     
14: last wait for 'gc object scan'     
15: waiting for 'latch: object queue header operation'[Latch 700000d1fc80b20]
16: waiting for 'latch: cache buffers chains'[Latch 700000cdf26c988]
17: waiting for 'latch: object queue header operation'[Latch 700000d1fe74fd0]
18: waiting for 'latch: object queue header operation'[Latch 700000d1fe74fd0]
19: last wait for 'gc object scan'     
20: last wait for 'gc object scan'     
21: last wait for 'gc object scan'     
22: waiting for 'latch: cache buffers chains'[Latch 700000cea8a6a80]
23: waiting for 'rdbms ipc message'    
24: waiting for 'gcs drm freeze in enter server mode'
25: waiting for 'gcs drm freeze in enter server mode'
26: waiting for 'gcs drm freeze in enter server mode'
27: waiting for 'gcs drm freeze in enter server mode'
28: waiting for 'gcs drm freeze in enter server mode'
29: waiting for 'gcs drm freeze in enter server mode'
30: waiting for 'gcs drm freeze in enter server mode'
31: waiting for 'gcs drm freeze in enter server mode'
32: waiting for 'rdbms ipc message'    
33: waiting for 'rdbms ipc message'    
34: last wait for 'smon timer'         
35: waiting for 'rdbms ipc message'    
36: waiting for 'rdbms ipc message'    
37: waiting for 'rdbms ipc message'    
38: waiting for 'rdbms ipc message'    
39:                                    
40:                                    
41: waiting for 'rdbms ipc message'    
Blockers
~~~~~~~~

Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
   systemstate.

                    Resource Holder State
       Latch 700000d1fe74fd0    ??? Blocker
       Latch 700000d1fc80b20    ??? Blocker
       Latch 700000cdf26c988    ??? Blocker
       Latch 700000cea8a6a80    ??? Blocker

Object Names
~~~~~~~~~~~~
Latch 700000d1fe74fd0 Child object queue header operation
Latch 700000d1fc80b20       holding    (efd=4) 700000d1fc80b20 Child obj
Latch 700000cdf26c988 Child cache buffers chains   
Latch 700000cea8a6a80 Child cache buffers chains   


762178 Lines Processed.


进一步通过分析crmdb2_diag_3732132.trc文件(分析过程详见附录),进程之间有如下关系:
(1)当故障发生时,orapid=7占用着latch: object queue header operation,addr=700000d1fc80b20,orapid=8,10,11,12,13,14,15,19,21处于等待状态,且会话均为dead状态。



(2)当故障发生时,orapid=9,20占用着latch: object queue header operation,addr=700000d1fe74fd0,orapid=2,17,18处于等待状态,其中2进程为alive,17,18进程均为dead状态。



(3)当故障发生时orapid=16等待latch: cache buffers chains,addr=700000cdf26c988,进程为dead状态




(4)当故障发生时orapid=12等待latch: cache buffers chains,addr=700000cdf26c988,进程为dead状态。





(5)值得注意的是故障发生时dbwr进程均处于dead状态,等待事件如下:
Dumping Session Wait History
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488289 sec
            =1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488292 sec
            =1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488290 sec
            =1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488290 sec
            =1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488289 sec
            =1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488292 sec
            =1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488295 sec
            =1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488293 sec
            =1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488291 sec
            =1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488289 sec

故障发生时各日志文件显示如下:
(1)1号节点日志信息:
crmdb1_lmon_4579776.trc显示,当1号节点LMON进程检测LMD进程到存在异常时(超时900秒),即向2号节点发起请求(requesting memberkill of instances w/o ftdones),2号节点收到之后由LMD完成宕库操作。
*** 2011-03-28 09:58:43.010
* kjfcdrmrcfg: waited 901 secs (6501391,6502292,900) for lmd to receive all ftdones, requesting memberkill of instances w/o ftdones:
kjfsprn: sync status  inst 0  tmout 900 (sec)
kjfsprn: sync propose inc 4  level 691378
kjfsprn: dmap ver 4 (step 0)
kjfsprn: sync inc 4  level 691378
kjfsprn: sync bitmap 0 1
DEFER MSG QUEUE ON LMD0 IS EMPTY
SEQUENCES:
  0:0.0  1:210167409.0


alert_crmdb1.log显示:
* kjfc_kill_no_sync: send abort msg to node 1 (noftdn)
Mon Mar 28 09:58:45 BEIST 2011


(2)2号节点信息:
crmdb2_lmd0_2154764.trc显示:2号节点lmd进程收到指令,准备宕节点,宕节点之前(5秒钟)使用diag进程进行systemstate dump。
*** 2011-03-28 09:58:44.845
Received a special admin message (type 2) from node 0
Message body: flag 0x5 data 0x0 0x0 0x0
Abort the instance
ksuitm: waiting up to [5] seconds before killing DIAG(3732132)



二、故障分析
LMS(GLOBAL CACHE SERVICE PROCESS)进程主要用来管理集群内数据块的访问,并在不同实例的BUFFER CACHE中传输块镜像。LMS进程跨集群管理数据库的请求,并保证在所有实例的BUFFER CACHE中一个数据块的镜像只能出现一次。LMS进程靠着在实例中传递消息来协调数据块的访问,当一个实例请求数据块时,该实例的LMD进程发出一个数据块资源的请求,该请求只向MASTER数据块的实例的LMD进程,MASTER实例的LMD进程同时正在使用的实例的LMD进程释放该资源,这时拥有该资源的实例的LMS进程会创建一个数据块镜像的一致性读,然后把该数据块传递到请求该资源的实例的BUFFER CACHE中。LMS进程保证了在每一时刻只能允许一个实例去更新数据块,并负责保持该数据块的镜像纪录(包含更新数据块的状态FLAG)。在故障发生时,由进程状态关系图,可以看到LMS进程处于latch free(latch: object queue header operation和latch: cache buffers chains)且状态为dead状态(LMS0除外)。也就意味着1,2号节点之间block传输出现了问题。由于PMON进程和LMS进程相互争用latch(latch: object queue header operation),进而PMON获取不了latch,出现超时等待10分钟,也就无法恢复有问题的进程。LMD (GLOBAL ENQUEUE SERVICE DAEMON)进程主要管理对全局队列和资源的访问,并更新相应队列的状态,处理来自于其他实例的资源请求。当2号节点LMS进程出现问题时,无法更新相应队列状态,由日志可以看出2号节点处于空闲事件等待状态(ges remote message)LMON(GLOBAL ENQUEUE SERVICE MONITOR)进程主要监测群集内的全局队列和全局资源,管理实例和处理异常并相应的群集队列进行恢复操作。当1号节点LMON进程检测到2号节点出现异常时,发送killmember指令,由2号节点lmd进程完成abort操作。我们同时注意到故障发生时,ckpt,dbwr进程均处于dead状态,为保证一致性处于pending i/o状态,dbwr等待事件为gcs drm freeze in enter server mode,该等待事件在Oracle有如下解释(metalink doc:4755405.8):
Under heavy load, DRM sync can take very long as LMON is waiting for LMS
processes to empty their send queues. This wait is unnecessary as
correctness is already guaranteed by the sync channel flush mechanism.
This shows up as sessions having large "gcs drm freeze in enter server mode" wait times. DRM takes a long time (more than 5 minutes).
也就是在在高负载到情况下,尤其是需要节点之间大规模传递一致性块时,DRM sync功能LMON会等待 LMS进程清空发送队列,当然在LMS进程清空发送队列时可能会引起latch free等待,进而引起GES问题,而Oracle GES(GES包括LMON和LMD进程)检测到有问题时,将会abort实例(metalink:ID 9920699.8)。Oracle还指出DRM功能可能会超过5分钟。针对这一现象,Oracle定位为bug:6960699
三、解决办法:
屏蔽DRM功能,设置如下隐含参数
_gc_affinity_time=0
_gc_undo_affinity=FALSE


附录:
经过awk格式化之后的文件可以看出latch 700000d1fc80b20成为首号怀疑对象,进而在跟踪文件crmdb2_diag_3732132.trc寻找相关信息。
PROCESS 7:
  ----------------------------------------
  SO: 700000d64f4f4c0, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=7, calls cur/top: 700000d69842360/700000d6983c508, flag: (6) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 94
              last post received-location: KJCS Post snd proxy to flush msg
              last process to post me: 700000d58f43670 1 6
              last post sent: 0 0 24
              last post sent-location: ksasnd
              last process posted by me: 700000d58f43670 1 6
    (latch info) wait_event=0 bits=20
      holding    (efd=4) 700000d1fc80b20 Child object queue header operation level=5 child#=77
        Location from where latch is held: kclobj_2:
        Context saved from call: 0
        state=busy(shared) (val=0x4000000000000009)
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           15 (2, 1301277525, 2)
           waiter count=1
    Process Group: DEFAULT, pseudo proc: 700000d6919aea0
    O/S info: user: oracle, term: UNKNOWN, ospid: 3830616
    OSD pid info: Unix process pid: 3830616, image: oracle@crmdb02 (LMS0)
从红色标注可以看出process=7(LMS0)进程holding child latch#=7,addr=700000d1fc80b20的object queue header operation锁,从而导致了process=15进程等待。

查看PROCESS=15状态:
PROCESS 15:
  ----------------------------------------
  SO: 700000d8cf24760, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=15, calls cur/top: 700000d698433f8/700000d6983db38, flag: (6) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 94
              last post received-location: KJCS Post snd proxy to flush msg
              last process to post me: 700000d58f43670 1 6
              last post sent: 0 0 98
              last post sent-location: kjbrcvdscn: scn acked
              last process posted by me: 700000d58f43670 1 6
    (latch info) wait_event=0 bits=0
        Location from where call was made: kclobj_2:
      waiting for 700000d1fc80b20 Child object queue header operation level=5 child#=77
        Location from where latch is held: kclobj_2:
        Context saved from call: 0
        state=busy(shared) (val=0x4000000000000009)
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           15 (2, 1301277525, 2)
           waiter count=1
          gotten 688460698 times wait, failed first 3268358 sleeps 1757249
          gotten 19572 times nowait, failed: 109
        possible holder pid = 7 ospid=3830616
      on wait list for 700000d1fc80b20
    Process Group: DEFAULT, pseudo proc: 700000d6919aea0
    O/S info: user: oracle, term: UNKNOWN, ospid: 3260452 (DEAD)
    OSD pid info: Unix process pid: 3260452, image: oracle@crmdb02 (LMS8)
从红色标注可以看出process=15(LMS8)进程正在等待child latch#=7,addr=700000d1fc80b20,且进程已处于DEAD状态。

用操作系统命令查看,发现只有一个会话等待此child latch
[ora10g@mcprod ~]$ cat  crmdb2_diag_3732132.trc|grep "waiting for 700000d1fc80b20" |wc -l
1

进一步查看共有9个会话holding此child latch。
[ora10g@mcprod ~]$ cat  crmdb2_diag_3732132.trc|grep "holding    (efd=4) 700000d1fc80b20" |wc -l
9

查看PROCESS=8状态。
PROCESS 8:
  ----------------------------------------
  SO: 700000d8cf23f70, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=8, calls cur/top: 700000d69842620/700000d6983c7c8, flag: (6) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 94
              last post received-location: KJCS Post snd proxy to flush msg
              last process to post me: 700000d58f43670 1 6
              last post sent: 0 0 98
              last post sent-location: kjbrcvdscn: scn acked
              last process posted by me: 700000d58f43670 1 6
    (latch info) wait_event=0 bits=20
      holding    (efd=4) 700000d1fc80b20 Child object queue header operation level=5 child#=77
        Location from where latch is held: kclobj_2:
        Context saved from call: 0
        state=busy(shared) (val=0x4000000000000009)
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           15 (2, 1301277525, 2)
           waiter count=1
    Process Group: DEFAULT, pseudo proc: 700000d6919aea0
    O/S info: user: oracle, term: UNKNOWN, ospid: 3503016 (DEAD)
    OSD pid info: Unix process pid: 3503016, image: oracle@crmdb02 (LMS1)
可以看出process=8(LMS1)状态等同于process=15。

查看PROCESS=9,PROCESS=20状态
PROCESS 9:
  ----------------------------------------
  SO: 700000d60f24a78, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=9, calls cur/top: 700000d698428e0/700000d6983caa0, flag: (6) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 94
              last post received-location: KJCS Post snd proxy to flush msg
              last process to post me: 700000d58f43670 1 6
              last post sent: 0 0 98
              last post sent-location: kjbrcvdscn: scn acked
              last process posted by me: 700000d58f43670 1 6
    (latch info) wait_event=0 bits=20
      holding    (efd=4) 700000d1fe74fd0 Child object queue header operation level=5 child#=91
        Location from where latch is held: kclobj_2:
        Context saved from call: 0
        state=busy(shared) (val=0x4000000000000002)
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           18 (2, 1301277525, 2)
           17 (2, 1301277525, 2)
           waiter count=2
    Process Group: DEFAULT, pseudo proc: 700000d6919aea0
    O/S info: user: oracle, term: UNKNOWN, ospid: 2982786 (DEAD)
    OSD pid info: Unix process pid: 2982786, image: oracle@crmdb02 (LMS2)
可以看到process=9(LMS2), process=20(LMSd),会话holding了addr=700000d1fe74fd0 child#91的child latch,进而导致了process=17(LMSa)和process=18(LMSb)等待。

查看PROCESS=17状态
PROCESS 17:
  ----------------------------------------
  SO: 700000d5cf23598, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=17, calls cur/top: 700000d67b9d3f8/700000d6983e0d0, flag: (6) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 94
              last post received-location: KJCS Post snd proxy to flush msg
              last process to post me: 700000d58f43670 1 6
              last post sent: 0 0 98
              last post sent-location: kjbrcvdscn: scn acked
              last process posted by me: 700000d58f43670 1 6
    (latch info) wait_event=0 bits=0
        Location from where call was made: kclobj_2:
      waiting for 700000d1fe74fd0 Child object queue header operation level=5 child#=91
        Location from where latch is held: kclobj_2:
        Context saved from call: 0
        state=busy(shared) (val=0x4000000000000002)
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           18 (2, 1301277525, 2)
           17 (2, 1301277525, 2)
           waiter count=2
          gotten 687558442 times wait, failed first 3419261 sleeps 1918111
          gotten 19762 times nowait, failed: 151
        possible holder pid = 9 ospid=2982786
      on wait list for 700000d1fe74fd0
    Process Group: DEFAULT, pseudo proc: 700000d6919aea0
    O/S info: user: oracle, term: UNKNOWN, ospid: 295048 (DEAD)
    OSD pid info: Unix process pid: 295048, image: oracle@crmdb02 (LMSa)

查看PROCESS=18状态
PROCESS 18:
  ----------------------------------------
  SO: 700000d58f42690, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=18, calls cur/top: 700000d69843f10/700000d6983e390, flag: (6) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 94
              last post received-location: KJCS Post snd proxy to flush msg
              last process to post me: 700000d58f43670 1 6
              last post sent: 0 0 98
              last post sent-location: kjbrcvdscn: scn acked
              last process posted by me: 700000d58f43670 1 6
    (latch info) wait_event=0 bits=0
        Location from where call was made: kclobj_2:
      waiting for 700000d1fe74fd0 Child object queue header operation level=5 child#=91
        Location from where latch is held: kclobj_2:
        Context saved from call: 0
        state=busy(shared) (val=0x4000000000000002)
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           18 (2, 1301277525, 2)
           17 (2, 1301277525, 2)
           waiter count=2
          gotten 687558442 times wait, failed first 3419261 sleeps 1918111
          gotten 19762 times nowait, failed: 151
        possible holder pid = 9 ospid=2982786
      on wait list for 700000d1fe74fd0
    Process Group: DEFAULT, pseudo proc: 700000d6919aea0
    O/S info: user: oracle, term: UNKNOWN, ospid: 3428506
    OSD pid info: Unix process pid: 3428506, image: oracle@crmdb02 (LMSb)
可以看到process=17,18均等待addr=700000d1fe74fd0 child latch。

PROCESS=10(LMS3),PROCESS=11(LMS4),PROCESS=12(LMS5),PORCESS=13(LMS6),PROCESS=14(LMS7),PROCESS=15(LMS8),PROCESS=19(LMSc),PROCESS=21(LMSe)状态均为类似状态,即进程holding child latch#=7,addr=700000d1fc80b20的object queue header operation锁,从而导致了process=15进程等待。

查看PROCESS=16状态
PROCESS 16:
  ----------------------------------------
  SO: 700000d60f25268, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=16, calls cur/top: 700000d69843990/700000d6983ddf8, flag: (6) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 94
              last post received-location: KJCS Post snd proxy to flush msg
              last process to post me: 700000d58f43670 1 6
              last post sent: 0 0 98
              last post sent-location: kjbrcvdscn: scn acked
              last process posted by me: 700000d58f43670 1 6
    (latch info) wait_event=0 bits=0
        Location from where call was made: kclgrantlk_2:
      waiting for 700000cdf26c988 Child cache buffers chains level=1 child#=162223
        Location from where latch is held: kcbchg: kslbegin: bufs not pinned:
        Context saved from call: 562086705
        state=busy(exclusive) (val=0x200000000000086f) holder orapid = 2159
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           16 (140, 1301277525, 2)
           waiter count=1
          gotten 4188007 times wait, failed first 98161 sleeps 554
          gotten 461256 times nowait, failed: 124
        possible holder pid = 2159 ospid=3392358
      on wait list for 700000cdf26c988
    Process Group: DEFAULT, pseudo proc: 700000d6919aea0
    O/S info: user: oracle, term: UNKNOWN, ospid: 3560010 (DEAD)

查看process=22状态
PROCESS 22:
  ----------------------------------------
  SO: 700000d8cf24f50, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=22, calls cur/top: 700000d67b9dc38/700000d6983eea8, flag: (6) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 94
              last post received-location: KJCS Post snd proxy to flush msg
              last process to post me: 700000d58f43670 1 6
              last post sent: 0 0 135
              last post sent-location: kclcget
              last process posted by me: 700000d60f53498 3 1
    (latch info) wait_event=0 bits=0
        Location from where call was made: kclexpandlock:
      waiting for 700000cea8a6a80 Child cache buffers chains level=1 child#=61810
        Location from where latch is held: kcbzgb: scan from tail. nowait:
        Context saved from call: 0
        state=busy(exclusive) (val=0x2000000000000456) holder orapid = 1110
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           22 (279, 1301277525, 2)
           waiter count=1
          gotten 5121038 times wait, failed first 15184 sleeps 753
          gotten 586404 times nowait, failed: 1427022
        possible holder pid = 1110 ospid=2978810
      on wait list for 700000cea8a6a80
    Process Group: DEFAULT, pseudo proc: 700000d6919aea0
    O/S info: user: oracle, term: UNKNOWN, ospid: 3109298 (DEAD)
    OSD pid info: Unix process pid: 3109298, image: oracle@crmdb02 (LMSf)

最后检查process=2状态
PROCESS 2:
  ----------------------------------------
  SO: 700000d60f24288, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=2, calls cur/top: 700000d6983b730/700000d6983b730, flag: (e) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 504403214633274144 133 2
              last post received-location: kslges
              last process to post me: 700000d5cf85938 137 1
              last post sent: 0 0 121
              last post sent-location: kcbzww
              last process posted by me: 700000d58f7f6e0 198 1
    (latch info) wait_event=0 bits=6
        Location from where call was made: kcbw_unlink_q:
      waiting for 700000d1fe74fd0 Child object queue header operation level=5 child#=91
        Location from where latch is held: kclobj_2:
        Context saved from call: 0
        state=busy(shared) (val=0x4000000000000002)
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           18 (2, 1301277525, 2)
           17 (2, 1301277525, 2)
           waiter count=2
          gotten 687558442 times wait, failed first 3419261 sleeps 1918111
          gotten 19762 times nowait, failed: 151
        possible holder pid = 9 ospid=2982786
      holding    (efd=8) 700000cf3a6e170 Child cache buffers lru chain level=2 child#=91
        Location from where latch is held: kcbzfw: buffer header
        Context saved from call: 0
        state=busy, wlstate=free
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           63 (50, 1301277525, 2)
           waiter count=1
      holding    (efd=8) 700000ce2a79ef0 Child cache buffers chains level=1 child#=81019
        Location from where latch is held: kcbso1: set no access:
        Context saved from call: 0
        state=busy(exclusive) (val=0x2000000000000002) holder orapid = 2
    Process Group: DEFAULT, pseudo proc: 700000d6919aea0
    O/S info: user: oracle, term: UNKNOWN, ospid: 1274566
    OSD pid info: Unix process pid: 1274566, image: oracle@crmdb02 (PMON)



  • 大小: 17.7 KB
  • 大小: 18.7 KB
  • 大小: 12.8 KB
  • 大小: 12.3 KB
分享到:
评论

相关推荐

    由重启引起的Oracle RAC节点宕机分析及追根溯源.docx

    ### 由重启引起的Oracle RAC节点宕机分析及追根溯源 #### 1. 背景介绍 本文档详细探讨了一个省级电信业务系统中,因频繁重启导致Oracle Real Application Clusters (RAC) 节点出现故障的情况。该系统在四套配置...

    RAC节点宕机故障分析

    RAC 节点宕机故障分析是指在 Oracle Real Application Clusters(RAC)环境中,节点宕机故障的诊断和解决方法。在这个主题中,我们将重点介绍 RAC 节点宕机故障的分析和解决方法,涵盖 ALERT 告警文件分析、事件分析...

    Oracle RAC故障节点删除步骤

    在 RAC-1 节点宕机,RAC-2 节点正常的情况下,需要在 RAC-2 节点上删除 RAC-1 节点信息。删除步骤如下: 1. 查看 RAC-2 节点数据库状态,确保一切正常。 2. 在 RAC-2 节点上,检查 pinned 状态,如果显示 unpinned...

    Oracle RAC安装和配置及ASM最佳实践

    当一个节点与集群中的其他节点失去联系超过Misscount设置的时间时,集群将认为该节点已经宕机。 1. **默认值**:默认情况下,Misscount的值为5秒。 2. **应用场景**:在高延迟网络环境中,可能需要适当增加...

    OracleRAC在VMWARE双机模式下测试报告

    ### OracleRAC在VMWARE双机模式下的测试分析 #### 一、测试背景与目标 在当前企业级应用环境中,Oracle RAC (Real Application Clusters) 技术因其高可用性和可扩展性而受到广泛欢迎。然而,在物理服务器环境下...

    Oracle-RAC-高可用-性能-上线前测试报告

    oracle RAC 在IBM小型机上的测试报告,双节点RAC的各项测试内容,单双节点的RAC性能测试内容,三节点RAC的RAC性能测试内容,RMAN测试,exp,imp测试,大批量呼叫数据测试,大批量查询测试,大事物测试,load balance...

    大话Oracle_RAC

    每个节点通过心跳机制检测彼此的状态,当心跳中断,节点可能会错误地认为其他节点已宕机,从而尝试接管整个集群。为解决这个问题,Oracle RAC采用了投票算法(Quorum Algorithm)。通过计算每个节点的票数(每个节点...

    Oracle rac入门和提高.docx

    相比于传统的双机热备,RAC能够在几秒钟内完成故障切换,显著降低了宕机时间。此外,RAC的可扩展性使得用户可以根据业务需求动态添加或删除节点,从而轻松扩展硬件资源。 然而,Oracle RAC也存在一定的缺点。首先,...

    Oracle-RAC-深度解释.doc

    但是,在某些情况下,节点之间的心跳机制可能会出现问题,导致每个节点认为其他节点宕机了,从而导致数据灾难。这种情况称为脑裂。为了解决脑裂问题,Oracle RAC 采用投票算法(Quorum Algorithm),通过投票来确定...

    oracle RAC 测试案例.doc

    - **重启单个RAC节点模拟测试**:在RAC环境中,当一个节点意外宕机或需要维护时,服务应该能够无缝地切换到其他节点继续运行。测试此场景是为了确保系统的容错能力,确保业务连续性不受影响。 2. **网络故障模拟...

    oracle 9i RAC sg

    - **常见问题分析**:如网络连接中断、节点宕机等情况的诊断方法。 - **应急措施**:提供故障转移方案,确保业务连续性不受影响。 - **案例分享**:结合实际案例讲解如何快速定位问题根源并采取有效措施。 #### 七...

    关于企业大数据的集群备份技术(ORACLE RAC)的解决方案.pdf

    当某个数据库节点宕机时,无需人工干预,其他节点将继续工作,保证业务连续性。 2. **强扩展性**:RAC允许轻松添加或移除节点以适应系统需求变化,实现动态扩展。 3. **高容量**:随着节点数量的增加,RAC的整体...

    CentOS6.x部署实践 Oracle 11g RAC集群搭建

    - **虚拟IP (VIP)**:RAC集群中每个节点都有自己的VIP,当节点宕机时,VIP会漂移到其他节点,从而保证客户端可以持续访问数据库服务。 - **私有IP**:用于集群内部节点间的心跳同步。私有IP不对外可见,是集群内部...

    Oracle10gRAC安装配置.ppt

    1.主备数据库:包括主数据库和备用数据库,备用数据库可以在主数据库宕机时接管主数据库的工作。 2.负载均衡:包括负载均衡算法、会话保持、负载均衡策略等。 3.故障恢复:包括故障检测、故障恢复、自动恢复等。 五...

    rac集群节点2宕机

    2015-09-12 06:51:03.598: [ CRSEVT][11304]32CAAMonitorHandler :: 0:Could not join /opt/oracle/product/10.2/crs/bin/racgwrap(check) category: 1234, operation: scls_process_join, loc: childcrash, OS ...

    Oracle 10g RAC安装配置.pptx

    Standby 可以提供热备用机制,可以在主数据库宕机时快速切换到备用数据库。 Oracle 10g RAC 安装配置需要注意系统环境检查、存储准备、CRS 安装升级、数据库软件安装升级等几个方面。只有按照正确的顺序和步骤进行...

    Oracle10gRAC安装与配置.pptx

    Standby 机制可以在主数据库宕机时,自动切换到备用数据库,确保数据库的高可用性。Oracle 10g RAC 和 Standby 的结合可以提供高性能、可靠性和高可用性的数据库解决方案。 Oracle 10g RAC 安装 Oracle 10g RAC ...

Global site tag (gtag.js) - Google Analytics