- 浏览: 985004 次
- 性别:
- 来自: 杭州
-
文章分类
最新评论
-
孤星119:
好熟悉的数据库字段啊, 上家公司做的项目每天都跟这些字段打招呼 ...
Oracle exp compress参数引起的空间浪费 -
itspace:
quxiaoyong 写道遇到个问题,网上一搜,全他妈这篇文章 ...
数据库连接错误ORA-28547 -
quxiaoyong:
遇到个问题,网上一搜,全他妈这篇文章。你转来转去的有意思吗?
数据库连接错误ORA-28547 -
hctech:
关于version count过高的问题,不知博主是否看过ey ...
某客户数据库性能诊断报告 -
itspace:
invalid 写道写的不错,我根据这个来安装,有点理解错误了 ...
AIX 配置vncserver
此次rac vip故障主要是由于vip所在网卡ent3(做了EtherChannel,即主备网卡绑定)出现故障,导致1号节点vip漂移至2号节点。
$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....b1.inst application ONLINE ONLINE crmdb01
ora....b2.inst application ONLINE ONLINE crmdb02
ora....db2.srv application ONLINE ONLINE crmdb02
ora....srv1.cs application ONLINE ONLINE crmdb02
ora.crmdb.db application ONLINE ONLINE crmdb02
ora....01.lsnr application ONLINE OFFLINE
ora....b01.gsd application ONLINE ONLINE crmdb01
ora....b01.ons application ONLINE ONLINE crmdb01
ora....b01.vip application ONLINE ONLINE crmdb02
ora....02.lsnr application ONLINE ONLINE crmdb02
ora....b02.gsd application ONLINE ONLINE crmdb02
ora....b02.ons application ONLINE ONLINE crmdb02
ora....b02.vip application ONLINE ONLINE crmdb02
解决办法处理相对比较简单,只要更换问题网卡,1号节点重启nodeapps即可,vip就自动从2号机切回1号机。
但通过此次故障,我们是不是可以更加挖掘一下,rac vip漂移背后的一些东西呢?
1号机故障发生时,在操作系统级别,我们可以看到一些错误:
$ netstat -in
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en0 1500 link#2 0.11.25.be.50.e9 2364166277 0 1352130944 371 0
en0 1500 3.3.22 3.3.22.1 2364166277 0 1352130944 371 0
en3 1500 link#3 0.11.25.be.4d.41 3591277841 0 1817998840 5 0
en3 1500 130.36.23 130.36.23.8 3591277841 0 1817998840 5 0
lo0 16896 link#1 1335635349 0 1335747477 0 0
lo0 16896 127 127.0.0.1 1335635349 0 1335747477 0 0
lo0 16896 ::1 1335635349 0 1335747477 0 0
$ errpt
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
173C787F 0416124011 I S topsvcs Possible malfunction on local adapter
4FC185D1 0416124011 T H ent1 TRANSMIT FAILURE
173C787F 0416095911 I S topsvcs Possible malfunction on local adapter
4FC185D1 0416095811 T H ent1 TRANSMIT FAILURE
4FC185D1 0416065011 T H ent1 TRANSMIT FAILURE
更为详细的错误如下所示:
$ errpt -a -j 4FC185D1|more
---------------------------------------------------------------------------
LABEL: GOENT_TX_ERR
IDENTIFIER: 4FC185D1
Date/Time: Sat Apr 16 12:40:04 BEIST 2011
Sequence Number: 10413
Machine Id: 00CE37F34C00
Node Id: crmdb01
Class: H
Type: TEMP
Resource Name: ent1
Resource Class: adapter
Resource Type: 14106802
Location: U5791.001.99B18ND-P1-C06-T1
VPD:
Product Specific.( ).......Gigabit Ethernet-SX PCI-X Adapter
Part Number.................10N8586
FRU Number..................10N8586
EC Level....................D76267
Manufacture ID..............YL1021
Network Address.............001125BE4D41
ROM Level.(alterable).......GOL021
Description
TRANSMIT FAILURE
Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES
Detail Data
FILE NAME
line: 2187 file: goent_tx.c
PCI ETHERNET STATISTICS
0000 25C5 0063 081B 0000 0003 0000 0003 0000 0000 0000 0000 0000 0000 0000 00DA
0000 010C D192 B18E 0001 B2FA DD4E 1CFC 0000 0041 1C93 93A5 0000 0000 0031 20A1
0000 00EE 256D C53E 0002 3042 90A3 0EE5 0000 0000 0000 0000 0000 0001 0001 B321
0000 09DF 0000 0000 0000 0000 0000 01DF 0000 000F 0000 0205 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 BBA3 087C 0200 D400 4120 8000 01A0 0000 0000
0230 0156 0009 F007 0443 C808 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000
DEVICE DRIVER INTERNAL STATE
2222 2222 256D C53E 0000 00C8
SOURCE ADDRESS
0011 25BE 4D41
---------------------------------------------------------------------------
LABEL: GOENT_TX_ERR
IDENTIFIER: 4FC185D1
$ errpt -a -j 173C787F|more
---------------------------------------------------------------------------
LABEL: TS_LOC_DOWN_ST
IDENTIFIER: 173C787F
Date/Time: Sat Apr 16 12:40:21 BEIST 2011
Sequence Number: 10414
Machine Id: 00CE37F34C00
Node Id: crmdb01
Class: S
Type: INFO
Resource Name: topsvcs
Description
Possible malfunction on local adapter
Probable Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured
Failure Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured
Recommended Actions
Verify adapter configuration
Verify network connectivity
Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.21,4983
ERROR ID
6zV5DL.pqFeB/ThN//Ml.1....................
REFERENCE CODE
Adapter interface name
en3
Adapter offset
0
Adapter IP address
130.36.23.8
由于硬件故障,我们对OS日志不做详细解读,我们关心的是故障发生一刻,Oracle做了什么?
故障发生时racg首先检测到vip发生故障,并再次进行了vip检测,racgvip check crmdb01,并记录至ora.crmdb01.vip.log中
2011-04-16 12:40:13.049: [ RACG][1] [4276526][1][ora.crmdb01.vip]: Invalid parameters, or failed to bring up VIP (host=crmdb01)
2011-04-16 12:40:13.054: [ RACG][1] [4276526][1][ora.crmdb01.vip]: clsrcexecut: env ORACLE_CONFIG_HOME=/opt/oracle/product/10.2.0.4/crs
2011-04-16 12:40:13.054: [ RACG][1] [4276526][1][ora.crmdb01.vip]: clsrcexecut: cmd = /opt/oracle/product/10.2.0.4/crs/bin/racgeut -e _USR_ORA_DEBUG=0 54 /opt/oracl
e/product/10.2.0.4/crs/bin/racgvip check crmdb01
2011-04-16 12:40:13.054: [ RACG][1] [4276526][1][ora.crmdb01.vip]: clsrcexecut: rc = 1, time = 4.405s
2011-04-16 12:40:13.054: [ RACG][1] [4276526][1][ora.crmdb01.vip]: end for resource = ora.crmdb01.vip, action = check, status = 1, time = 4.572s
检测结束后,判断存在异常之后,由crs进程执行vip漂移动作,可以看到当crs检测到vip异常offline之后(OFFLINE unexpectedly),
首先停止了监听,然后将组件ora.crmdb.crmsrv1.crmdb2.srv漂移至crmdb02即2号节点。
2011-04-16 12:40:13.058: [ CRSAPP][11051]32CheckResource error for ora.crmdb01.vip error code = 1
2011-04-16 12:40:13.071: [ CRSRES][11051]32In stateChanged, ora.crmdb01.vip target is ONLINE
2011-04-16 12:40:13.072: [ CRSRES][11051]32ora.crmdb01.vip on crmdb01 went OFFLINE unexpectedly
2011-04-16 12:40:13.072: [ CRSRES][11051]32StopResource: setting CLI values
2011-04-16 12:40:13.086: [ CRSRES][11051]32Attempting to stop `ora.crmdb01.vip` on member `crmdb01`
2011-04-16 12:40:13.487: [ CRSRES][11312]32In stateChanged, ora.crmdb.crmsrv1.crmdb2.srv target is ONLINE
2011-04-16 12:40:13.487: [ CRSRES][11312]32ora.crmdb.crmsrv1.crmdb2.srv on crmdb01 went OFFLINE unexpectedly
2011-04-16 12:40:13.488: [ CRSRES][11312]32StopResource: setting CLI values
2011-04-16 12:40:13.520: [ CRSRES][11312]32Attempting to stop `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb01`
2011-04-16 12:40:13.636: [ CRSRES][11051]32Stop of `ora.crmdb01.vip` on member `crmdb01` succeeded.
2011-04-16 12:40:13.636: [ CRSRES][11051]32ora.crmdb01.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
2011-04-16 12:40:13.650: [ CRSRES][11051]32ora.crmdb01.vip failed on crmdb01 relocating.
2011-04-16 12:40:13.770: [ CRSRES][11051]32StopResource: setting CLI values
2011-04-16 12:40:13.786: [ CRSRES][11051]32Attempting to stop `ora.crmdb01.LISTENER_CRMDB01.lsnr` on member `crmdb01`
2011-04-16 12:40:14.093: [ CRSRES][11312]32Stop of `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb01` succeeded.
2011-04-16 12:40:14.094: [ CRSRES][11312]32ora.crmdb.crmsrv1.crmdb2.srv RESTART_COUNT=0 RESTART_ATTEMPTS=0
2011-04-16 12:40:14.105: [ CRSRES][11312]32ora.crmdb.crmsrv1.crmdb2.srv failed on crmdb01 relocating.
2011-04-16 12:40:14.150: [ CRSRES][11312]32Attempting to start `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb02`
2011-04-16 12:40:14.442: [ CRSRES][11312]32Start of `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb02` succeeded.
此时2号节点crs日志显示如下:
2011-04-16 12:40:14.148: [ CRSRES][11617]32startRunnable: setting CLI values
2011-04-16 12:40:24.488: [ CRSRES][12145]32CRS-1002: Resource 'ora.crmdb.crmsrv1.cs' is already running on member 'crmdb02'
需要注意的是,vip出现故障,甚至会将和vip相关的资源全部停止,
If the VIP fails for any reason and cannot be restarted, CRS will bring down all dependent resources, including the Listener, ASM instance and database instance. CRS will attempt to bring these resources down gracefully - hence, a shutdown immediate will be issued, and will be seen in the alert log of the ASM instance - no errors will be evident in the alert log for the ASM instance.
以下来自一metalink (ID 277274.1) 案例,此故障经常在10.1上出现
`ora.rmsclnxclu1.vip` on `rmsclnxclu1` went OFFLINE unexpectedly
2004-06-21 21:21:05.562: Attempting to stop `ora.rmsclnxclu1.vip` on member `rmsclnxclu1`
RTD #0: Action Script /home/oracle/product/crs/bin/racgwrap(stop) timed out for ora.rmsclnxclu1.vip! (timeout=60)
2004-06-21 21:22:16.472: [RTI:884782] StopResource error for ora.rmsclnxclu1.vip error code = 1
2004-06-21 21:22:18.611: `ora.rmsclnxclu1.vip` on member `rmsclnxclu1` has experienced an unrecoverable failure.
2004-06-21 21:22:18.611: Human intervention required to resume its availability.
2004-06-21 21:22:18.790: [RUNNABLELISTENER:884782] Resource failed into UNKNOWN, killing dependents
`ora.rmsclnxclu1.vip` experienced a failure on `rmsclnxclu1`. Stopping dependent resources.
2004-06-21 21:22:20.525: Attempting to stop `ora.gofod.gofod1.inst` on member `rmsclnxclu1`
2004-06-21 21:25:38.531: Stop of `ora.gofod.gofod1.inst` on member `rmsclnxclu1` succeeded.
2004-06-21 21:25:38.611: Attempting to stop `ora.rmsclnxclu1.LISTENER_rmsclnxclu1.lsnr` on member `rmsclnxclu1`
2004-06-21 21:25:38.983: Stop of `ora.rmsclnxclu1.LISTENER_rmsclnxclu1.lsnr` on member `rmsclnxclu1` succeeded.
2004-06-21 21:25:39.041: Attempting to stop `ora.rmsclnxclu1.ASM1.asm` on member `rmsclnxclu1`
2004-06-21 21:25:46.669: Stop of `ora.rmsclnxclu1.ASM1.asm` on member `rmsclnxclu1` succeeded.
2004-06-21 21:25:46.728: Attempting to stop `ora.rmsclnxclu1.vip` on member `rmsclnxclu1`
2004-06-21 21:25:55.547: Stop of `ora.rmsclnxclu1.vip` on member `rmsclnxclu1` succeeded.
如果出现上述故障或者vip经常自动offline,可以用以下思路来解决问题:
1、启用vip跟踪,如果vip出现故障,可以进一步获得更为详细的日志信息
开启vip跟踪:
[root@node1 admin]# crsctl debug log res ora.node1.vip:1
Set Resource Debug Module: ora.node1.vip Level: 1
关闭vip跟踪
[root@node1 admin]# crsctl debug log res ora.node1.vip:0
Set Resource Debug Module: ora.node1.vip Level: 0
在11 R2中开启跟踪语法变为:
#crsctl set log res "ora.rmntops1.vip.com:1"
2、修改vip检查间隔时间和脚本超时时间,vip检查间隔时间从默认的30秒改为120秒,脚本超时时间从60秒改为120秒。
1. Create the .cap file for each vip resource (on each node):
./crs_stat -p ora.rmsclnxclu1.vip > /tmp/ora.rmsclnxclu1.vip.cap
2. Then, update the .cap file using the following syntax and values:
./crs_profile -update ora.rmsclnxclu1.vip -dir /tmp -o ci=120,st=120
(Where ci = the CHECK_INTERVAL and st = the SCRIPT_TIMEOUT value.)
3. Finally, re-register it using the '-u' option:
./crs_register ora.rmsclnxclu1.vip -dir /tmp -u
3、如果是10.1的话,可以在asm资源中将vip相关性移除:
ASM resource name is in the form of ora.<nodename>.<ASM instance name>.asm.
VIP resource name is in the form of ora.<nodename>.vip
- crs_stat -p <ASM resource name> > /tmp/<ASM resource name>.cap
- Edit /tmp/<ASM resource name>.cap to remove VIP resource name from the REQUIRED_RESOURCES attribute.
- crs_register -u <ASM resource name> -dir /tmp
- Use "crs_stat -p <ASM resource name>" to verify if REQUIRED_RESOURCE attribute is updated.
$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....b1.inst application ONLINE ONLINE crmdb01
ora....b2.inst application ONLINE ONLINE crmdb02
ora....db2.srv application ONLINE ONLINE crmdb02
ora....srv1.cs application ONLINE ONLINE crmdb02
ora.crmdb.db application ONLINE ONLINE crmdb02
ora....01.lsnr application ONLINE OFFLINE
ora....b01.gsd application ONLINE ONLINE crmdb01
ora....b01.ons application ONLINE ONLINE crmdb01
ora....b01.vip application ONLINE ONLINE crmdb02
ora....02.lsnr application ONLINE ONLINE crmdb02
ora....b02.gsd application ONLINE ONLINE crmdb02
ora....b02.ons application ONLINE ONLINE crmdb02
ora....b02.vip application ONLINE ONLINE crmdb02
解决办法处理相对比较简单,只要更换问题网卡,1号节点重启nodeapps即可,vip就自动从2号机切回1号机。
但通过此次故障,我们是不是可以更加挖掘一下,rac vip漂移背后的一些东西呢?
1号机故障发生时,在操作系统级别,我们可以看到一些错误:
$ netstat -in
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en0 1500 link#2 0.11.25.be.50.e9 2364166277 0 1352130944 371 0
en0 1500 3.3.22 3.3.22.1 2364166277 0 1352130944 371 0
en3 1500 link#3 0.11.25.be.4d.41 3591277841 0 1817998840 5 0
en3 1500 130.36.23 130.36.23.8 3591277841 0 1817998840 5 0
lo0 16896 link#1 1335635349 0 1335747477 0 0
lo0 16896 127 127.0.0.1 1335635349 0 1335747477 0 0
lo0 16896 ::1 1335635349 0 1335747477 0 0
$ errpt
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
173C787F 0416124011 I S topsvcs Possible malfunction on local adapter
4FC185D1 0416124011 T H ent1 TRANSMIT FAILURE
173C787F 0416095911 I S topsvcs Possible malfunction on local adapter
4FC185D1 0416095811 T H ent1 TRANSMIT FAILURE
4FC185D1 0416065011 T H ent1 TRANSMIT FAILURE
更为详细的错误如下所示:
$ errpt -a -j 4FC185D1|more
---------------------------------------------------------------------------
LABEL: GOENT_TX_ERR
IDENTIFIER: 4FC185D1
Date/Time: Sat Apr 16 12:40:04 BEIST 2011
Sequence Number: 10413
Machine Id: 00CE37F34C00
Node Id: crmdb01
Class: H
Type: TEMP
Resource Name: ent1
Resource Class: adapter
Resource Type: 14106802
Location: U5791.001.99B18ND-P1-C06-T1
VPD:
Product Specific.( ).......Gigabit Ethernet-SX PCI-X Adapter
Part Number.................10N8586
FRU Number..................10N8586
EC Level....................D76267
Manufacture ID..............YL1021
Network Address.............001125BE4D41
ROM Level.(alterable).......GOL021
Description
TRANSMIT FAILURE
Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES
Detail Data
FILE NAME
line: 2187 file: goent_tx.c
PCI ETHERNET STATISTICS
0000 25C5 0063 081B 0000 0003 0000 0003 0000 0000 0000 0000 0000 0000 0000 00DA
0000 010C D192 B18E 0001 B2FA DD4E 1CFC 0000 0041 1C93 93A5 0000 0000 0031 20A1
0000 00EE 256D C53E 0002 3042 90A3 0EE5 0000 0000 0000 0000 0000 0001 0001 B321
0000 09DF 0000 0000 0000 0000 0000 01DF 0000 000F 0000 0205 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 BBA3 087C 0200 D400 4120 8000 01A0 0000 0000
0230 0156 0009 F007 0443 C808 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000
DEVICE DRIVER INTERNAL STATE
2222 2222 256D C53E 0000 00C8
SOURCE ADDRESS
0011 25BE 4D41
---------------------------------------------------------------------------
LABEL: GOENT_TX_ERR
IDENTIFIER: 4FC185D1
$ errpt -a -j 173C787F|more
---------------------------------------------------------------------------
LABEL: TS_LOC_DOWN_ST
IDENTIFIER: 173C787F
Date/Time: Sat Apr 16 12:40:21 BEIST 2011
Sequence Number: 10414
Machine Id: 00CE37F34C00
Node Id: crmdb01
Class: S
Type: INFO
Resource Name: topsvcs
Description
Possible malfunction on local adapter
Probable Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured
Failure Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured
Recommended Actions
Verify adapter configuration
Verify network connectivity
Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.21,4983
ERROR ID
6zV5DL.pqFeB/ThN//Ml.1....................
REFERENCE CODE
Adapter interface name
en3
Adapter offset
0
Adapter IP address
130.36.23.8
由于硬件故障,我们对OS日志不做详细解读,我们关心的是故障发生一刻,Oracle做了什么?
故障发生时racg首先检测到vip发生故障,并再次进行了vip检测,racgvip check crmdb01,并记录至ora.crmdb01.vip.log中
2011-04-16 12:40:13.049: [ RACG][1] [4276526][1][ora.crmdb01.vip]: Invalid parameters, or failed to bring up VIP (host=crmdb01)
2011-04-16 12:40:13.054: [ RACG][1] [4276526][1][ora.crmdb01.vip]: clsrcexecut: env ORACLE_CONFIG_HOME=/opt/oracle/product/10.2.0.4/crs
2011-04-16 12:40:13.054: [ RACG][1] [4276526][1][ora.crmdb01.vip]: clsrcexecut: cmd = /opt/oracle/product/10.2.0.4/crs/bin/racgeut -e _USR_ORA_DEBUG=0 54 /opt/oracl
e/product/10.2.0.4/crs/bin/racgvip check crmdb01
2011-04-16 12:40:13.054: [ RACG][1] [4276526][1][ora.crmdb01.vip]: clsrcexecut: rc = 1, time = 4.405s
2011-04-16 12:40:13.054: [ RACG][1] [4276526][1][ora.crmdb01.vip]: end for resource = ora.crmdb01.vip, action = check, status = 1, time = 4.572s
检测结束后,判断存在异常之后,由crs进程执行vip漂移动作,可以看到当crs检测到vip异常offline之后(OFFLINE unexpectedly),
首先停止了监听,然后将组件ora.crmdb.crmsrv1.crmdb2.srv漂移至crmdb02即2号节点。
2011-04-16 12:40:13.058: [ CRSAPP][11051]32CheckResource error for ora.crmdb01.vip error code = 1
2011-04-16 12:40:13.071: [ CRSRES][11051]32In stateChanged, ora.crmdb01.vip target is ONLINE
2011-04-16 12:40:13.072: [ CRSRES][11051]32ora.crmdb01.vip on crmdb01 went OFFLINE unexpectedly
2011-04-16 12:40:13.072: [ CRSRES][11051]32StopResource: setting CLI values
2011-04-16 12:40:13.086: [ CRSRES][11051]32Attempting to stop `ora.crmdb01.vip` on member `crmdb01`
2011-04-16 12:40:13.487: [ CRSRES][11312]32In stateChanged, ora.crmdb.crmsrv1.crmdb2.srv target is ONLINE
2011-04-16 12:40:13.487: [ CRSRES][11312]32ora.crmdb.crmsrv1.crmdb2.srv on crmdb01 went OFFLINE unexpectedly
2011-04-16 12:40:13.488: [ CRSRES][11312]32StopResource: setting CLI values
2011-04-16 12:40:13.520: [ CRSRES][11312]32Attempting to stop `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb01`
2011-04-16 12:40:13.636: [ CRSRES][11051]32Stop of `ora.crmdb01.vip` on member `crmdb01` succeeded.
2011-04-16 12:40:13.636: [ CRSRES][11051]32ora.crmdb01.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
2011-04-16 12:40:13.650: [ CRSRES][11051]32ora.crmdb01.vip failed on crmdb01 relocating.
2011-04-16 12:40:13.770: [ CRSRES][11051]32StopResource: setting CLI values
2011-04-16 12:40:13.786: [ CRSRES][11051]32Attempting to stop `ora.crmdb01.LISTENER_CRMDB01.lsnr` on member `crmdb01`
2011-04-16 12:40:14.093: [ CRSRES][11312]32Stop of `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb01` succeeded.
2011-04-16 12:40:14.094: [ CRSRES][11312]32ora.crmdb.crmsrv1.crmdb2.srv RESTART_COUNT=0 RESTART_ATTEMPTS=0
2011-04-16 12:40:14.105: [ CRSRES][11312]32ora.crmdb.crmsrv1.crmdb2.srv failed on crmdb01 relocating.
2011-04-16 12:40:14.150: [ CRSRES][11312]32Attempting to start `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb02`
2011-04-16 12:40:14.442: [ CRSRES][11312]32Start of `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb02` succeeded.
此时2号节点crs日志显示如下:
2011-04-16 12:40:14.148: [ CRSRES][11617]32startRunnable: setting CLI values
2011-04-16 12:40:24.488: [ CRSRES][12145]32CRS-1002: Resource 'ora.crmdb.crmsrv1.cs' is already running on member 'crmdb02'
需要注意的是,vip出现故障,甚至会将和vip相关的资源全部停止,
If the VIP fails for any reason and cannot be restarted, CRS will bring down all dependent resources, including the Listener, ASM instance and database instance. CRS will attempt to bring these resources down gracefully - hence, a shutdown immediate will be issued, and will be seen in the alert log of the ASM instance - no errors will be evident in the alert log for the ASM instance.
以下来自一metalink (ID 277274.1) 案例,此故障经常在10.1上出现
`ora.rmsclnxclu1.vip` on `rmsclnxclu1` went OFFLINE unexpectedly
2004-06-21 21:21:05.562: Attempting to stop `ora.rmsclnxclu1.vip` on member `rmsclnxclu1`
RTD #0: Action Script /home/oracle/product/crs/bin/racgwrap(stop) timed out for ora.rmsclnxclu1.vip! (timeout=60)
2004-06-21 21:22:16.472: [RTI:884782] StopResource error for ora.rmsclnxclu1.vip error code = 1
2004-06-21 21:22:18.611: `ora.rmsclnxclu1.vip` on member `rmsclnxclu1` has experienced an unrecoverable failure.
2004-06-21 21:22:18.611: Human intervention required to resume its availability.
2004-06-21 21:22:18.790: [RUNNABLELISTENER:884782] Resource failed into UNKNOWN, killing dependents
`ora.rmsclnxclu1.vip` experienced a failure on `rmsclnxclu1`. Stopping dependent resources.
2004-06-21 21:22:20.525: Attempting to stop `ora.gofod.gofod1.inst` on member `rmsclnxclu1`
2004-06-21 21:25:38.531: Stop of `ora.gofod.gofod1.inst` on member `rmsclnxclu1` succeeded.
2004-06-21 21:25:38.611: Attempting to stop `ora.rmsclnxclu1.LISTENER_rmsclnxclu1.lsnr` on member `rmsclnxclu1`
2004-06-21 21:25:38.983: Stop of `ora.rmsclnxclu1.LISTENER_rmsclnxclu1.lsnr` on member `rmsclnxclu1` succeeded.
2004-06-21 21:25:39.041: Attempting to stop `ora.rmsclnxclu1.ASM1.asm` on member `rmsclnxclu1`
2004-06-21 21:25:46.669: Stop of `ora.rmsclnxclu1.ASM1.asm` on member `rmsclnxclu1` succeeded.
2004-06-21 21:25:46.728: Attempting to stop `ora.rmsclnxclu1.vip` on member `rmsclnxclu1`
2004-06-21 21:25:55.547: Stop of `ora.rmsclnxclu1.vip` on member `rmsclnxclu1` succeeded.
如果出现上述故障或者vip经常自动offline,可以用以下思路来解决问题:
1、启用vip跟踪,如果vip出现故障,可以进一步获得更为详细的日志信息
开启vip跟踪:
[root@node1 admin]# crsctl debug log res ora.node1.vip:1
Set Resource Debug Module: ora.node1.vip Level: 1
关闭vip跟踪
[root@node1 admin]# crsctl debug log res ora.node1.vip:0
Set Resource Debug Module: ora.node1.vip Level: 0
在11 R2中开启跟踪语法变为:
#crsctl set log res "ora.rmntops1.vip.com:1"
2、修改vip检查间隔时间和脚本超时时间,vip检查间隔时间从默认的30秒改为120秒,脚本超时时间从60秒改为120秒。
1. Create the .cap file for each vip resource (on each node):
./crs_stat -p ora.rmsclnxclu1.vip > /tmp/ora.rmsclnxclu1.vip.cap
2. Then, update the .cap file using the following syntax and values:
./crs_profile -update ora.rmsclnxclu1.vip -dir /tmp -o ci=120,st=120
(Where ci = the CHECK_INTERVAL and st = the SCRIPT_TIMEOUT value.)
3. Finally, re-register it using the '-u' option:
./crs_register ora.rmsclnxclu1.vip -dir /tmp -u
3、如果是10.1的话,可以在asm资源中将vip相关性移除:
ASM resource name is in the form of ora.<nodename>.<ASM instance name>.asm.
VIP resource name is in the form of ora.<nodename>.vip
- crs_stat -p <ASM resource name> > /tmp/<ASM resource name>.cap
- Edit /tmp/<ASM resource name>.cap to remove VIP resource name from the REQUIRED_RESOURCES attribute.
- crs_register -u <ASM resource name> -dir /tmp
- Use "crs_stat -p <ASM resource name>" to verify if REQUIRED_RESOURCE attribute is updated.
发表评论
-
buffer cache 的内部结构
2020-03-18 14:21 594BUFFER CACHE作为数据块的 ... -
Oracle OMC介绍
2020-03-18 13:19 500Oracle管理云服务(OMC)的大数据平台,自动收集的企业 ... -
参加Oracle勒索病毒防范专题培训会议
2019-09-27 17:15 5492019年7月22日,受邀参加Oracle勒索病毒防范专题培训 ... -
记一次内存换IO的Oracle优化
2019-09-27 16:50 839某客户数据库从P595物理 ... -
如何定位Oracle SQL执行计划变化的原因
2019-07-03 14:49 1483性能优化最难的是能够 ... -
如何定位Oracle SQL执行计划变化的原因
2018-10-30 09:24 1185性能优化最难的是能够 ... -
数据库性能优化目标
2018-10-08 10:59 535从数据库性能优化的场 ... -
数据库无法打开的原因及解决办法
2018-10-05 20:45 2156数据库的启动是一个相当复杂的过程。比如,Oracle在启动之前 ... -
怎么样彻底删除数据库?
2018-09-18 11:10 617Oracle提供了drop database命令用来删除数据库 ... -
Oracle减少日志量的方法
2018-09-10 10:17 879LGWR进程将LOG BUFFER中的 ... -
如何快速关闭数据库
2018-09-09 13:14 1249“一朝被蛇咬,十年怕井绳”。在没被“蛇”咬之前,很多DBA喜欢 ... -
关于《如何落地智能化运维》PPT
2018-05-17 10:19 1145在DTCC 2018发表《如何落地智能化运维》演讲,主要内容如 ... -
记录在redhat5.8平台安装oracle11.2容易忽视的几个问题
2018-05-11 19:58 591问题一:ping不通问题 在虚拟机上安装好linux系统后, ... -
《Oracle DBA实战攻略》第一章
2018-05-11 10:42 979即日起,不定期更新《OracleDBA实战攻略》一书电子版,请 ... -
Oracle 12c新特性
2018-05-11 10:33 912查询所有pdb [oracle@gj4 ~]$ sqlplu ... -
关于修改memory_target的值后数据库无法启动的问题
2017-02-28 12:24 3995操作系统:RHEL6.5 数据库版本:11.2.0.4 ... -
10g rac安装error while loading shared libraries libpthread.so.0 问题
2017-02-28 12:22 71211g rac安装在二节点跑脚本一般会报此错误: 解决这个问 ... -
记一次Oracle会话共享模式故障处理过程
2017-02-27 19:16 811故障简述 XXX第八人民医院HIS数据库7月13日11点左右从 ... -
RESMGR:cpu quantum等待事件处理过程
2017-02-27 18:23 2690由于数据库上线过程中出现大量的RESMGR:cpu quant ... -
谈谈log file sync
2014-03-19 14:18 1786数据库中的log file sync等待事件指的是,当user ...
相关推荐
RAC 故障分析与处理 RAC(Real Application Clusters)是 Oracle 公司开发的一种高可用性集群解决方案,旨在提供高性能、可扩展性和高可用性。RAC 故障分析与处理是指对 RAC 环境中出现的故障进行分析和处理,以...
一、RAC 故障节点删除步骤 在 RAC-1 节点宕机,RAC-2 节点正常的情况下,需要在 RAC-2 节点上删除 RAC-1 节点信息。删除步骤如下: 1. 查看 RAC-2 节点数据库状态,确保一切正常。 2. 在 RAC-2 节点上,检查 ...
从整个故障处理过程中我们可以看出,作者在处理Oracle RAC故障时采取了一系列方法和步骤。这个过程强调了对Oracle环境的深入理解,特别是在面对存储、多路径以及集群配置方面的问题时。通过仔细分析日志、利用资源...
在本文中,我们将记录一次在Vmware ESXi6虚拟机环境下搭建Oracle RAC的过程。Oracle RAC(Real Application Clusters)是一种高可用性解决方案,能够提供高性能和高可用性数据库服务。在本文中,我们将详细介绍搭建...
本文主要探讨了在Oracle RAC环境中利用DBCA(Database Configuration Assistant)工具进行数据库创建时遇到的一个特殊问题——无法成功创建RAC数据库的情况,并提供了详细的故障排查与解决方案。 #### BLOG文档结构...
从提供的文件内容中,我们可以提取出与RAC安装相关的故障处理知识,以及一些Oracle数据库相关的操作技巧。下面将详细说明这些知识点。 1. RAC安装故障处理 在RAC(Real Application Clusters,真正应用集群)安装...
RAC 节点宕机故障分析是一个复杂的过程,需要对节点宕机故障的原因、事件、ORA-600 错误、Bug 和 LMS 进程进行分析和确认。通过对这些方面的分析,可以获取节点宕机故障的相关信息,诊断和解决 RAC 节点宕机故障。
在处理这类故障时,理解RAC和ASM的工作原理、熟悉Oracle的故障排除工具以及保持良好的备份策略都是非常重要的。此外,定期进行灾难恢复演练也有助于提高团队应对这类问题的能力。对于具体的问题,可以参考Oracle的...
"Oracle RAC OCR磁盘故障快速恢复方案" 本文将详细介绍Oracle RAC OCR磁盘故障的各种故障现象及其解决方案,旨在帮助读者快速恢复OCR磁盘故障,确保数据库服务的正常运作。 一、问题的提出 在Oracle RAC的测试...
综合以上内容,RAC坏节点重建是一个复杂的过程,它需要数据库管理员对Oracle集群的架构和运维工具有深入的理解,以及对可能出现的错误和异常进行准确的判断和处理。在执行修复操作之前,最好能够有完整的备份,以便...
根据提供的文件内容,以下是对RAC环境中常用的命令及其所检查状态的详细解释: ### 1....通过以上命令,我们可以全面地了解RAC集群的各个方面的状态和配置情况,这对于日常维护和故障排查非常重要。
Oracle RAC
【故障处理】BLOG_DBCA建库诡异问题处理--rac环境不能创建rac库.pdf【故障处理】BLOG_DBCA建库诡异问题处理--rac环境不能创建rac库.pdf
Oracle 11.2.0.4 RAC更换vip,SCAN ip操作方案-v2 Oracle 11.2.0.4 RAC更换vip,SCAN ip操作方案-v2
Oracle 12c Real Application Clusters (RAC) 是一种高可用性和可伸缩性的数据库解决方案,它允许多个服务器实例同时访问同一个数据库,提供故障切换和负载均衡能力。以下是对Oracle 12c RAC在ESXi平台安装过程中的...
总的来说,Oracle 19c RAC+RACDG的配置是一个复杂的过程,需要对数据库架构、操作系统、网络以及故障恢复策略有深入理解。通过细致的规划、配置和维护,可以构建出一个强大且高度可用的数据库环境,以满足上亿级生产...
- 直接终止CRS进程并重启主机后,发现虚拟IP(VIP)没有自动漂移到节点1上,这表明集群资源管理器未能正确处理节点的故障转移。 #### 分析思路 针对此故障,采取以下步骤进行分析: 1. **检查DB下的Alert日志及...
该文档记录了oracle11g rac修改public ip/private ip/vip的实施步骤
1. **Oracle RAC 11gR2**:Oracle 11g Release 2 (11.2.0.3) 是一个高可用性和可扩展性的数据库解决方案,允许多个实例同时访问同一数据库,提供故障切换和负载均衡能力。 2. **Red Hat Enterprise Linux 6.3 x64**...
文中详细阐述了准备工作如目录规划、软件备份、运行干运行为以及Grid软件和数据库的升级步骤等,有助于保障整个迁移项目的成功实施,同时也列举了一些常见的注意事项和故障处理办法,对于可能遇见的问题给予了具体的...