`

【故障】RAC ASM磁盘路径故障导致OCR和Voting Disk被Force Dismount

阅读更多

RAC ASM磁盘路径故障导致OCRVoting DiskForce Dismount

 

 

创金合信基金管理有限公司的金融交易系统底层使用的Oracle数据库为2节点RAC架构,其版本为Oracle Database 11.2.0.4 AMD64,操作系统平台为Redhat Linux 5.7 x86-64,数据库主机为实体小型机DELL系列,内存32GB。在上一交易日结束后托管机房运维部门要求相关厂商更换存储控制器内存模块,并未通知要求停止Oracle RAC数据库和集群件Clusterware,致使次一交易日股市开盘前巡检发现数据库无法正常连接。但是现有的应用层连接不受影响,能够正常访问数据库。

 

使用集群资源状态检查命令查询相关集群资源的状态如下所示,我们发现使用crsctl stat res -t查询集群资源操作失败:在节点1和节点2上资源ora.crsd均发生失败。

 

 

+ASM1@prod1:/u01/app/oragrid/11.2.0.2/bin>./crsctl stat res -t -init
-------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  ONLINE       prod1            Started
ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       prod1
ora.crf
      1        ONLINE  ONLINE       prod1
ora.crsd
      1        ONLINE  OFFLINE
ora.cssd
      1        ONLINE  ONLINE       prod1
ora.cssdmonitor
      1        ONLINE  ONLINE       prod1
ora.ctssd
      1        ONLINE  ONLINE       prod1            OBSERVER
ora.diskmon
      1        ONLINE  ONLINE       prod1
ora.drivers.acfs
      1        ONLINE  ONLINE       prod1
ora.evmd
      1        ONLINE  INTERMEDIATE prod1
ora.gipcd
      1        ONLINE  ONLINE       prod1
ora.gpnpd
      1        ONLINE  ONLINE       prod1
ora.mdnsd
      1        ONLINE  ONLINE       prod1

 

然后我们使用crsctl check crs命令检查CRS守护进程的相关状态,发现该命令返回如下所示的错误信息:集群就绪服务无法正常通信。

 
CRS-4535: Cannot Communicate With Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.

 

我们使用ocrcheck命令检查表决磁盘Voting DiskOracle Cluster Registry集群注册表发现如下错误,无法获取OCRVoting Disk的验证信息:

 

CRS-4535: Cannot Communicate With Cluster Ready Services
CRS-4000: Command check failed, or completed with errors.

 

该信息说明目前集群无法验证Voting DiskOracle Cluster Registry所在的磁盘组,那么如果该磁盘所在的磁盘组出现故障或者内部错误,那么使用ocrcheck是没有办法获得验证信息的。通过结合前面的检查ora.crsd(集群就绪服务守护进程)资源并没有ONLINE,可知很大程度上集群注册表已经不能访问了。于是,我们很自然的想到从底层的ASM磁盘组的状态入手,查看ASM_DISKGROUP的相关状态如下:

 

SQL> select name, state from v$asm_diskgroup;
NAME              STATE
------------------------------ -----------
DATADG             MOUNTED
CRSDG              DISMOUNTED 
ASMCMD> ls
ASMCMD> lsdg
+DATADG

 

根据以上信息我们知道CRSDG表决磁盘和集群注册表所在的磁盘组在oracle RAC ASM实例层面已经处于DISMOUNTED状态。当然这一信息也可以通过ASMCMD命令行的ls或者lsdg来查看。而在操作系统层面,我们发现相应的裸设备所属磁盘并没有DISMOUNT,也就是说很可能是因为我们配置的多路径multipath出现故障或者ASM设备暂时在RAC ASM实例层面无法识别。我们可以通过ora.crsd服务的日志进一步的获取相关信息。

 

Crsd.log日志文件实时的反映了故障的发生时间和具体的错误信息。并且crsd.log记录的错误发生时刻与深圳证券通信公司机房工程师开始维护的时间一致,完全吻合。托管机房16:10-17:30进行维护操作,crsd.log16:12:15出现设备的访问故障,并且被RAC ASM层面所探测到,具体的日志错误性信息如下所示:

 

2016-04-30 16:13:15.564: [UiServer][2968028928]{1:53576:13269} Container [ Name: UI_STOP
	API_HDR_VER: 
	TextMessage[2]
	ASYNC_TAG: 
	TextMessage[1]
	CLIENT: 
	TextMessage[]
	CLIENT_NAME: 
	TextMessage[/u01/grid/11.2.0/bin/oracle]
	CLIENT_PID: 
	TextMessage[3337]
	CLIENT_PRIMARY_GROUP: 
	TextMessage[oinstall]
	EVENT_TAG: 
	TextMessage[1]
	FILTER: 
	TextMessage[(((NAME==ora.CRSDG.dg)&&(LAST_SERVER==rac1))&&
(STATE!=OFFLINE))USR_ORA_OPI=true]
	FILTER_TAG: 
	TextMessage[1]
	FORCE_TAG: 
	TextMessage[1]
	LOCALE: 
	TextMessage[AMERICAN_AMERICA.AL32UTF8]
	NO_WAIT_TAG: 
	TextMessage[1]
	QUEUE_TAG: 
	TextMessage[1]
]
2016-04-30 16:13:15.564: [UiServer][2968028928]{1:53576:13269} Sending message to PE. ctx= 0x7f6a1800c1c0, Client PID: 3337
2016-04-30 16:13:15.564: [   CRSPE][2970130176]{1:53576:13269} Cmd : 0x7f6a2412f480 : flags: EVENT_TAG | FORCE_TAG | QUEUE_TAG
2016-04-30 16:13:15.564: [  CRSPE][2970130176]{1:53576:13269} Processing PE command id=78811. Description: [Stop Resource : 0x7f6a2412f480]
2016-04-30 16:13:15.564: [   CRSPE][2970130176]{1:53576:13269} Expression Filter : (((NAME == ora.CRSDG.dg) AND (LAST_SERVER == rac1)) AND (STATE != OFFLINE))
2016-04-30 16:13:15.565: [   CRSPE][2970130176]{1:53576:13269} Expression Filter : (((NAME == ora.CRSDG.dg) AND (LAST_SERVER == rac1)) AND (STATE != OFFLINE))
2016-04-30 16:13:15.565: [   CRSPE][2970130176]{1:53576:13269} Attribute overrides for the command: USR_ORA_OPI = true;
2016-04-30 16:13:15.565: [   CRSPE][2970130176]{1:53576:13269} Filtering duplicate ops: server [] state [OFFLINE]
2016-04-30 16:13:15.565: [   CRSPE][2970130176]{1:53576:13269} Op 0x7f6a2410d8b0 has 5 WOs
2016-04-30 16:13:15.566: [   CRSPE][2970130176]{1:53576:13269} RI [ora.CRSDG.dg rac1 1] new target state: [OFFLINE] old value: [ONLINE]
2016-04-30 16:13:15.566: [  CRSOCR][2978535168]{1:53576:13269} Multi Write Batch processing...
2016-04-30 16:13:15.566: [   CRSPE][2970130176]{1:53576:13269} RI [ora.CRSDG.dg rac1 1] new internal state: [STOPPING] old value: [STABLE]
2016-04-30 16:13:15.566: [   CRSPE][2970130176]{1:53576:13269} Sending message to agfw: id = 1774249
2016-04-30 16:13:15.566: [    AGFW][2980636416]{1:53576:13269} Agfw Proxy Server received the message: RESOURCE_STOP[ora.CRSDG.dg rac1 1] ID 4099:1774249
2016-04-30 16:13:15.566: [   CRSPE][2970130176]{1:53576:13269} CRS-2673: Attempting to stop 'ora.CRSDG.dg' on 'rac1'

 

我们根据以上日志信息发现,当机房运维人员拔下存储控制器的时候,crsd.logoracle层面已经检测到这一行为,并且oracle 集群件认为这一行为有必要停止CRSDG磁盘组的挂载,从而避免不必要的Corruption,于是oracle集群件进行了一个操作Container [ Name: UI_STOP,停止相关资源的服务并且DISMOUNTED磁盘组。我们都可以通过以下部分日志来得出以上的结论。尤其是集群件的crsd守护进程向外发送的TextMessage信息,尤其值得注意:

 

TextMessage[CRS-2673: Attempting to stop 'ora.CRSDG.dg' on 'rac1']
TextMessage[CRS-2677: Stop of 'ora.CRSDG.dg' on 'rac1' succeeded]
2016-04-30 16:13:15.566: [    AGFW][2980636416]{1:53576:13269} Agfw Proxy Server forwarding the message: RESOURCE_STOP[ora.CRSDG.dg rac1 1] ID 4099:1774249 to the agent /u01/grid/11.2.0/bin/oraagent_grid
2016-04-30 16:13:15.566: [UiServer][2968028928]{1:53576:13269} Container [ Name: ORDER
	MESSAGE: 
	TextMessage[CRS-2673: Attempting to stop 'ora.CRSDG.dg' on 'rac1']
	MSGTYPE: 
	TextMessage[3]
	OBJID: 
	TextMessage[ora.CRSDG.dg rac1 1]
	WAIT: 
	TextMessage[0]
]
2016-04-30 16:13:15.566: [ COMMCRS][2968028928]clscsendx: (0x7f6a5c0e9eb0) Connection not active

2016-04-30 16:13:15.566: [UiServer][2968028928]{1:53576:13269} CS(0x7f6a1c009ec0)Error sending msg over socket.6
2016-04-30 16:13:15.567: [UiServer][2968028928]{1:53576:13269} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.
(File: clsMessageStream.cpp, line: 275

2016-04-30 16:13:15.568: [    AGFW][2980636416]{1:53576:13269} Received the reply to the message: RESOURCE_STOP[ora.CRSDG.dg rac1 1] ID 4099:1774250 from the agent /u01/grid/11.2.0/bin/oraagent_grid
2016-04-30 16:13:15.569: [    AGFW][2980636416]{1:53576:13269} Agfw Proxy Server sending the reply to PE for message:RESOURCE_STOP[ora.CRSDG.dg rac1 1] ID 4099:1774249
2016-04-30 16:13:15.569: [   CRSPE][2970130176]{1:53576:13269} Received reply to action [Stop] message ID: 1774249
2016-04-30 16:13:15.587: [    AGFW][2980636416]{1:53576:13269} Received the reply to the message: RESOURCE_STOP[ora.CRSDG.dg rac1 1] ID 4099:1774250 from the agent /u01/grid/11.2.0/bin/oraagent_grid
2016-04-30 16:13:15.587: [    AGFW][2980636416]{1:53576:13269} Agfw Proxy Server sending the last reply to PE for message:RESOURCE_STOP[ora.CRSDG.dg rac1 1] ID 4099:1774249
2016-04-30 16:13:15.587: [   CRSPE][2970130176]{1:53576:13269} Received reply to action [Stop] message ID: 1774249
2016-04-30 16:13:15.587: [   CRSPE][2970130176]{1:53576:13269} RI [ora.CRSDG.dg rac1 1] new internal state: [STABLE] old value: [STOPPING]
2016-04-30 16:13:15.588: [   CRSPE][2970130176]{1:53576:13269} RI [ora.CRSDG.dg rac1 1] new external state [OFFLINE] old value: [ONLINE] label = [] 
2016-04-30 16:13:15.588: [   CRSPE][2970130176]{1:53576:13269} CRS-2677: Stop of 'ora.CRSDG.dg' on 'rac1' succeeded

2016-04-30 16:13:15.588: [  CRSRPT][2968028928]{1:53576:13269} Published to EVM CRS_RESOURCE_STATE_CHANGE for ora.CRSDG.dg
2016-04-30 16:13:15.588: [UiServer][2968028928]{1:53576:13269} Container [ Name: ORDER
	MESSAGE: 
	TextMessage[CRS-2677: Stop of 'ora.CRSDG.dg' on 'rac1' succeeded]
	MSGTYPE: 
	TextMessage[3]
	OBJID: 
	TextMessage[ora.CRSDG.dg rac1 1]
	WAIT: 
	TextMessage[0]
]
2016-04-30 16:13:15.588: [UiServer][2968028928]{1:53576:13269} CS(0x7f6a1c009ec0)No connection to client.6
2016-04-30 16:13:15.588: [UiServer][2968028928]{1:53576:13269} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.
(File: clsMessageStream.cpp, line: 275

 

完成ora.CRSDG.dg服务的STOP操作的验证记录信息如下所示,同样的系统调用了 Container 这一函数:不过参数换成了UI_DATA

 

2016-04-30 16:13:15.590: [UiServer][2968028928]{1:53576:13269} Container [ Name: UI_DATA
	ora.CRSDG.dg rac1 1: 
	TextMessage[0]
]

 

同时,我们可以查看crsdOUT.log日志来发现具体的集群件的后续行为是否异常,我们可以很清晰的看到该日志文件中的异常信息,因为该日志只是为了启动crsd守护进程而进行目录的切换以及命令执行的最终结果信息,其并不提供详细的日志行为信息,所以较为简单明了。

 

2016-04-30 18:36:52  
CRSD REBOOT
CRSD exiting: Could not init OCR, code: 26
2016-04-30 18:36:54  
Changing directory to /u01/grid/11.2.0/log/rac1/crsd
2016-04-30 18:36:54  
CRSD REBOOT
CRSD exiting: Could not init OCR, code: 26
2016-04-30 18:36:56  
Changing directory to /u01/grid/11.2.0/log/rac1/crsd
2016-04-30 18:36:56  
CRSD REBOOT

 

可以看到,CRSD集群就绪服务守护进程一直在尝试着初始化OCR集群注册表,但是一直失败,因为前面的日志信息表明OCR ASM磁盘组已经处于DISMOUNT状态。我们查看操作系统级别的日志信息/var/log/message如下所示:以下部分日志信息验证了我们关于multipath绑定裸设备的DISMOUNTED故障的假设。

 

Apr 30 16:12:57 rac1 kernel: rport-8:0-1: blocked FC remote port time out: removing target and saving binding
Apr 30 16:12:57 rac1 kernel: sd 8:0:1:1: rejecting I/O to offline device
Apr 30 16:12:57 rac1 kernel: sd 8:0:1:2: rejecting I/O to offline device
Apr 30 16:12:57 rac1 kernel: sd 8:0:1:3: rejecting I/O to offline device
Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:224.
Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:192.
Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:208.
Apr 30 16:12:57 rac1 kernel: scsi 8:0:1:0: rejecting I/O to offline device
Apr 30 16:12:57 rac1 kernel: scsi 8:0:1:0: [sdl] killing request
Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:176.
Apr 30 16:12:57 rac1 kernel: scsi 8:0:1:0: [sdl] Unhandled error code
Apr 30 16:12:57 rac1 kernel: scsi 8:0:1:0: [sdl] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Apr 30 16:12:57 rac1 kernel: scsi 8:0:1:0: [sdl] CDB: Read(10): 28 00 00 03 50 60 00 00 20 00
Apr 30 16:12:57 rac1 kernel: rport-9:0-0: blocked FC remote port time out: removing target and saving binding
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:0: rejecting I/O to offline device
Apr 30 16:12:57 rac1 multipathd: ocrvote1: load table [0 4194304 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 3 1 8:112 1 8:192 1 65:16 1 round-robin 0 1 1 8:32 1]
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:0: rejecting I/O to offline device
Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:96.
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:0: rejecting I/O to offline device
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:0: rejecting I/O to offline device
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:1: rejecting I/O to offline device
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:1: rejecting I/O to offline device
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:3: rejecting I/O to offline device
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:3: rejecting I/O to offline device
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:3: rejecting I/O to offline device
Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:144.
Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:112.
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:1: rejecting I/O to offline device
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:1: alua: rtpg failed with 10000
Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:112.
Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Could not failover the device: Handler scsi_dh_alua Error 15.
Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:192.
Apr 30 16:12:57 rac1 kernel: sd 9:0:1:1: alua: port group 03 state A non-preferred supports TolUsNA
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:1: rejecting I/O to offline device
Apr 30 16:12:57 rac1 multipathd: 8:144: mark as failed
Apr 30 16:12:57 rac1 multipathd: ocrvote3: remaining active paths: 3
Apr 30 16:12:57 rac1 multipathd: 8:224: mark as failed
Apr 30 16:12:57 rac1 multipathd: ocrvote3: remaining active paths: 2
Apr 30 16:12:57 rac1 multipathd: 8:208: mark as failed
Apr 30 16:12:57 rac1 multipathd: ocrvote2: remaining active paths: 3
Apr 30 16:12:57 rac1 multipathd: sdl: remove path (uevent)
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:1: alua: rtpg failed with 10000
Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:112.
Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Could not failover the device: Handler scsi_dh_alua Error 15.
Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:192.
Apr 30 16:12:57 rac1 multipathd: data1: load table [0 482402304 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 1 1 8:96 1 round-robin 0 2 1 8:16 1 65:0 1]
Apr 30 16:12:57 rac1 multipathd: sdl: path removed from map data1
Apr 30 16:12:57 rac1 multipathd: 8:112: mark as failed
Apr 30 16:12:57 rac1 multipathd: ocrvote1: remaining active paths: 3
Apr 30 16:12:57 rac1 multipathd: 8:192: mark as failed
Apr 30 16:12:57 rac1 multipathd: ocrvote1: remaining active paths: 2
Apr 30 16:12:57 rac1 multipathd: sdm: remove path (uevent)
Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Could not failover the device: Handler scsi_dh_alua Error 15.
Apr 30 16:12:57 rac1 kernel: scsi 8:0:1:0: alua: Detached
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:2: rejecting I/O to offline device
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:2: [sdi] killing request
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:2: rejecting I/O to offline device
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:2: [sdi] Unhandled error code
Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:128.
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:2: [sdi] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Apr 30 16:12:57 rac1 kernel: sd 9:0:0:2: [sdi] CDB: Write(10): 2a 00 00 08 00 11 00 00 01 00
Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:96.
Apr 30 16:12:57 rac1 kernel: sd 8:0:0:0: alua: port group 03 state A non-preferred supports TolUsNA
Apr 30 16:12:57 rac1 kernel: sd 9:0:1:0: alua: port group 03 state A non-preferred supports TolUsNA
Apr 30 16:12:57 rac1 kernel: sd 8:0:0:2: alua: port group 03 state A non-preferred supports TolUsNA
Apr 30 16:12:57 rac1 kernel: sd 9:0:1:2: alua: port group 03 state A non-preferred supports TolUsNA
Apr 30 16:12:57 rac1 kernel: sd 8:0:0:3: alua: port group 03 state A non-preferred supports TolUsNA
Apr 30 16:12:57 rac1 kernel: sd 9:0:1:3: alua: port group 03 state A non-preferred supports TolUsNA
Apr 30 16:12:57 rac1 multipathd: ocrvote1: load table [0 4194304 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 2 1 8:112 1 65:16 1 round-robin 0 1 1 8:32 1]
Apr 30 16:12:57 rac1 multipathd: sdm: path removed from map ocrvote1
Apr 30 16:12:57 rac1 multipathd: sdg: remove path (uevent)
Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Could not failover the device: Handler scsi_dh_alua Error 15.
Apr 30 16:12:57 rac1 kernel: scsi 8:0:1:1: alua: Detached
Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:112.
Apr 30 16:12:57 rac1 kernel: sd 9:0:1:1: alua: port group 03 state A non-preferred supports TolUsNA
Apr 30 16:12:57 rac1 multipathd: data1: load table [0 482402304 multipath 1 queue_if_no_path 1 alua 1 1 round-robin 0 2 1 8:16 1 65:0 1]
Apr 30 16:12:57 rac1 multipathd: sdg: path removed from map data1
Apr 30 16:12:57 rac1 multipathd: sdn: remove path (uevent)
Apr 30 16:12:57 rac1 kernel: scsi 9:0:0:0: alua: Detached
Apr 30 16:12:57 rac1 kernel: device-mapper: table: 253:2: multipath: error getting device
Apr 30 16:12:57 rac1 kernel: device-mapper: ioctl: error adding target to table
Apr 30 16:12:57 rac1 kernel: sd 8:0:0:0: alua: port group 03 state A non-preferred supports TolUsNA
Apr 30 16:12:57 rac1 kernel: sd 9:0:1:0: alua: port group 03 state A non-preferred supports TolUsNA
Apr 30 16:12:57 rac1 kernel: device-mapper: table: 253:2: multipath: error getting device
Apr 30 16:12:57 rac1 multipathd: ocrvote2: failed in domap for removal of path sdn
Apr 30 16:12:57 rac1 multipathd: uevent trigger error
Apr 30 16:12:57 rac1 multipathd: sdh: remove path (uevent)
Apr 30 16:12:57 rac1 kernel: device-mapper: ioctl: error adding target to table
Apr 30 16:12:57 rac1 kernel: scsi 9:0:0:1: alua: Detached
Apr 30 16:12:57 rac1 multipathd: ocrvote1: load table [0 4194304 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 1 1 65:16 1 round-robin 0 1 1 8:32 1]
Apr 30 16:12:57 rac1 multipathd: sdh: path removed from map ocrvote1
Apr 30 16:12:57 rac1 multipathd: sdo: remove path (uevent)
Apr 30 16:12:57 rac1 kernel: sd 9:0:1:1: alua: port group 03 state A non-preferred supports TolUsNA
Apr 30 16:12:57 rac1 kernel: device-mapper: table: 253:3: multipath: error getting device
Apr 30 16:12:57 rac1 kernel: device-mapper: ioctl: error adding target to table
Apr 30 16:12:57 rac1 kernel: device-mapper: table: 253:3: multipath: error getting device
Apr 30 16:12:57 rac1 kernel: device-mapper: ioctl: error adding target to table
Apr 30 16:12:57 rac1 multipathd: ocrvote3: failed in domap for removal of path sdo
Apr 30 16:12:57 rac1 multipathd: uevent trigger error
Apr 30 16:12:57 rac1 multipathd: sdp: remove path (uevent)
Apr 30 16:12:57 rac1 kernel: device-mapper: table: 253:4: multipath: error getting device
Apr 30 16:12:57 rac1 kernel: device-mapper: ioctl: error adding target to table
Apr 30 16:12:57 rac1 multipathd: data2: failed in domap for removal of path sdp
Apr 30 16:12:57 rac1 multipathd: uevent trigger error
Apr 30 16:12:57 rac1 multipathd: sdi: remove path (uevent)
Apr 30 16:12:57 rac1 multipathd: ocrvote2: load table [0 4194304 multipath 1 queue_if_no_path 1 alua 1 1 round-robin 0 2 1 8:48 1 65:32 1]
Apr 30 16:12:57 rac1 multipathd: sdi: path removed from map ocrvote2
Apr 30 16:12:57 rac1 multipathd: sdj: remove path (uevent)
Apr 30 16:12:57 rac1 kernel: device-mapper: table: 253:4: multipath: error getting device
Apr 30 16:12:57 rac1 kernel: device-mapper: ioctl: error adding target to table
Apr 30 16:12:57 rac1 kernel: scsi 9:0:0:2: alua: Detached
Apr 30 16:12:57 rac1 kernel: scsi 8:0:1:2: alua: Detached

这里我们会产生一个疑问,为什么ora.crsd挂掉,但是ora.cssd没有OFFLINE(通过crsctl stat res -t -init可以确认ora.cssd没有挂掉,数据库实例还正常运行,节点并没有被踢出去),原因在于OCRVotingDisk对应的磁盘只是短暂的不可访问,cssd进程是直接访问OCRVotingDisk对应的3ASM磁盘,并不依赖于OCRVotingDisk磁盘组必须处于MOUNT状态,并且Clusterware默认的磁盘心跳超时时间为120,所以cssd进程没有出现问题。

 

如果multipath对应的绑定裸设备没有在操作系统层面识别并挂载,那么我们必须首先重新启动数据库服务器主机,或者采用其他手段先解决raw设备的挂载MOUNT问题(如果您使用的是raw设备做的ASM设备名绑定),一般情况下该故障可以通过重启数据库服务器主机或者HAS服务解决。但是,也存在很多情况下CRSDG 对应的ASM磁盘组文件头损坏或者CRS以及Voting Disk的数据丢失,都必定不能使得CRSDG重新挂载,这需要寻求其他的解决方案。

 

 

分享到:
评论

相关推荐

    Oracle RAC 11gR2日常维护管理之OCR和VotingDisk维护.pdf

    Oracle RAC 11gR2中引入的新特性包括:OCR和Voting Disk都放置在ASM磁盘组中,本地注册表OLR(OCR Location Registration)用于存储集群中每个节点的OCR配置信息。这样的设计不仅可以优化集群的管理,还能加快节点...

    Oracle RAC ASM磁盘组故障解决办法

    Oracle RAC ASM 磁盘组故障解决办法 本文旨在解决 Oracle RAC 环境中的磁盘组故障问题,具体来说是解决磁盘无法挂载、集群服务无法启动的问题。通过对问题的分析和解决,文章将从问题的背景、问题描述、故障解决...

    Oracle 11.2.0.3 RAC 重建 OCR 和 Voting Disk 案例

    Oracle 11g中的Real ...这个案例文档详细记录了重建OCR和Voting Disks的全过程,对于理解Oracle RAC的维护和故障恢复具有很高的参考价值。通过深入学习和实践,你可以更好地掌握Oracle RAC环境的管理和故障处理。

    Oracle 10G RAC下OCR和Voting disk的管理

    ### Oracle 10G RAC 下 OCR 和 Voting Disk 的管理 #### 一、Voting Disk (表决磁盘)管理 **定义与作用** 在Oracle 10g RAC环境中,Voting Disk(表决磁盘)主要用于记录集群中的成员信息,包括当前有哪些节点参与...

    Oracle_RAC_CRS、OCR、Voting破坏重建

    破坏重建 CRS、OCR 和 Voting Disk 是一个复杂的过程,需要小心翼翼地操作,否则可能会导致集群不可用的情况。通过本文,我们可以了解到破坏重建 CRS、OCR 和 Voting Disk 的详细步骤,从而更好地管理和维护 Oracle ...

    ORACLE ASM添加磁盘操作步骤

    对于高可用性环境来说,备份和恢复Oracle Cluster Registry (OCR) 文件以及Voting disk是非常重要的,这有助于在集群故障后快速恢复服务。 ##### 8.3 ASM元数据备份与修复 定期备份ASM的元数据,并掌握如何在必要时...

    voting disk破坏后的恢复

    在Oracle Real Application Clusters (RAC)环境中,Voting Disk(表决磁盘)是集群同步的关键组件之一。它主要用于集群节点间的选举过程,确保集群的一致性和高可用性。当Voting Disk受到损坏时,可能会导致集群无法...

    使用NFS作为ASM磁盘组搭建RAC.docx

    使用NFS作为ASM磁盘组搭建RAC 本文将详细介绍如何使用NFS(Network File System)作为ASM(Automatic Storage Management)磁盘组来搭建Oracle RAC(Real Application Clusters)集群。该集群基于Linux 7操作系统,...

    Oracle RAC部署环境准备手册之三:ASM磁盘组建

    ASM是Oracle数据库自带的一种集成化的存储管理解决方案,它提供了自动化的磁盘管理和故障恢复功能,特别适合于Oracle RAC环境。 在Linux环境下,ASM磁盘的组建主要包括以下步骤: 1. **共享磁盘虚拟机配置**: - ...

    Oracle ASM 磁盘组扩容方案

    ### Oracle ASM 磁盘组扩容方案 #### 概述 在Oracle数据库环境中,自动存储管理(Automatic Storage Management, ASM)是一种用于管理和配置数据库存储的技术。它为Oracle数据库提供了高性能、高可用性和易于管理的...

    Oracle-11g-R2-RAC-with-ASM存储迁移-手记.docx

    3.迁移步骤的详细说明:整个迁移过程可以分为十个步骤,包括划分 ASM 磁盘、备份 OCR、Voting Disk、ASM 磁盘头和数据库、创建新的 DISKGROUP、迁移 OCR 和 Vote Disks 到新磁盘组、迁移 ASM Spfile 到新磁盘组、...

    12CR2 RAC_多路径_ASM Failgroup_NFS实现存储双活详细搭建步骤.docx

    此套存储双活环境上线之前经过严格的测试,排除了所有的单点故障,不管是手动模拟故障,还是暴力的直接拔电源,集群都能自行恢复正常运行,目前这套环境已经正式上线半年,运行稳定,性能优异。若基础设施性能没有...

    网盘资料\oracle相关书籍\Oracle RAC系列之_10gR2 RAC(ASM) Data Guard容灾配置手册

    1. **RAC安装与配置**:如何在多台服务器上安装和配置RAC实例,包括网络设置、OCR(Oracle Cluster Registry)和Voting Disks的配置。 2. **ASM配置**:如何创建和管理ASM磁盘组,分配磁盘空间,以及配置ASM实例以...

    oracle10g rac asm for linux

    oracle10g rac asm for linux oracle10g rac asm for linux

    Redhat AS4 Oracle10g+ASM单实例转RAC

    4. 创建OCR和Voting Disks:在ASM中创建OCR和Voting Disks,用于存储集群配置信息和选举主节点。 5. 初始化参数调整:修改数据库的初始化参数文件,适应RAC环境,如SGA分配、后台进程等。 6. 配置Listener和Net...

    oracle rac asm包

    在Oracle 11g RAC的环境中,ASM(Automatic Storage Management)是Oracle推荐的存储管理解决方案,它简化了存储管理并提供了自动化的磁盘管理和故障恢复功能。 在Oracle RAC与ASM的部署中,有几个关键组件至关重要...

Global site tag (gtag.js) - Google Analytics