Oracle RAC之节点故障：File table overflow

itspace

浏览: 985954 次
性别:
来自: 杭州

最近访客更多访客>>

qxbirth

Janne

zhangcaiyanbeyond

luyi670

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

ORACLE管理

Oracle OS HP SUN Socket

某客户数据库于2010年4月26日早晨9点左右发生单节点故障，后台故障表现为一节点数据库(hisdb01)异常终止，进一步导致一节点主机重启。前台故障表现为部分业务不可用。由于没有部署主机性能跟踪脚本，只能根据现场日志描述初步推断为主机资源不足(如文件句柄没有释放)从而导致Oracle实例异常终止。2010年6月7日早晨9点再次发生单节点故障。
后台日志分析
查看发生故障前后各种日志
1、操作系统日志

引用

Jun 7 09:08:18 hisdb01 cmcld[7603]: Unable to accept a connection: File table overflow
Jun 7 09:08:18 hisdb01 cmclconfd[2878]: Unable to allocate a socket: File table overflow
Jun 7 09:08:18 hisdb01 cmclconfd[2878]: Unable to open /etc/cmcluster/cmclconfig, File table overflow
Jun 7 09:08:18 hisdb01 cmcld[7603]: Unable to accept a connection: File table overflow
Jun 7 09:08:18 hisdb01 cmclconfd[2878]: Unable to resolve local hostname hisdb01 to determine the domain name
Jun 7 09:08:18 hisdb01 cmclconfd[2878]: Unable to allocate a socket: File table overflow
Jun 7 09:08:19 hisdb01 cmcld[7603]: Sending file $SGRUN/frdump.cmcld.8 (167257 bytes) to file assistant daemon.
Jun 7 09:08:18 hisdb01 cmclconfd[2878]: Unable to open /etc/cmcluster/cmclconfig, File table overflow
Jun 7 09:08:19 hisdb01 above message repeats 3 times
Jun 7 09:08:19 hisdb01 cmfileassistd[2894]: Updated file /var/adm/cmcluster/frdump.cmcld.8 (length = 167257).
Jun 7 09:09:00 hisdb01 inetd[1018]: hacl-cfg/tcp: accept: File table overflow
Jun 7 09:09:19 hisdb01 cmcld[7603]: Service cmfileassistd terminated due to an exit(0).
Jun 7 09:12:16 hisdb01 syslog: Unable to open the /etc/utmpx file, to sync the records from file->/usr/sbin/utmpd
Jun 7 09:12:17 hisdb01 vmunix: file: table is full
Jun 7 09:12:17 hisdb01 above message repeats 13576 times
Jun 7 09:12:17 hisdb01 vmunix: file: table is full
Jun 7 09:12:17 hisdb01 syslogd: utmp database: Bad file number
Jun 7 09:12:17 hisdb01 vmunix: file: table is full
Jun 7 09:12:17 hisdb01 above message repeats 10 times
Jun 7 09:12:17 hisdb01 vmunix: file: table is full
Jun 7 09:12:17 hisdb01 vmunix: file: table is full
Jun 7 09:12:17 hisdb01 above message repeats 17 times
Jun 7 09:12:17 hisdb01 vmunix: file: table is full
Jun 7 09:12:17 hisdb01 vmunix: file: table is full

2、crs后台日志：

引用

2010-06-06 21:15:09.225: [ CRSEVT][167223] CAAMonitorHandler :: 0:Could not execute /oracle/app/product/db10g/bin/racgwrap(check) for ora.orcl.orcl1.inst
category: 1234, operation: scls_process_spawn, loc: out_pipe, OS error: 23, other: out of memory

2010-06-06 21:15:09.225: [ CRSAPP][167223] CheckResource error for ora.orcl.orcl1.inst error code = -1
2010-06-06 21:15:19.211: [ CRSEVT][167224] CAAMonitorHandler :: 0:Could not execute /oracle/app/product/crs/bin/racgwrap(check) for ora.hisdb01.ons
category: 1234, operation: scls_process_spawn, loc: out_pipe, OS error: 23, other: out of memory

2010-06-06 21:15:19.211: [ CRSAPP][167224] CheckResource error for ora.hisdb01.ons error code = -1
2010-06-06 21:16:18.020: [ CRSEVT][167225] CAAMonitorHandler :: 0:Could not execute /oracle/app/product/db10g/bin/racgwrap(check) for ora.hisdb01.ASM1.asm
category: 1234, operation: scls_process_spawn, loc: out_pipe, OS error: 23, other: out of memory

2010-06-06 21:16:18.021: [ CRSAPP][167225] CheckResource error for ora.hisdb01.ASM1.asm error code = -1

3、实例orcl1日志:

引用

Sun Jun 6 21:08:42 2010
Errors in file /oracle/app/product/admin/orcl/udump/orcl1_ora_2915.trc:
ORA-00603: Message 603 not found; No message file for product=RDBMS, facility=ORA
ORA-27544: Message 27544 not found; No message file for product=RDBMS, facility=ORA
ORA-27300: Message 27300 not found; No message file for product=RDBMS, facility=ORA; arguments: [socket] [23]
ORA-27301: Message 27301 not found; No message file for product=RDBMS, facility=ORA; arguments: [File table overflow]
ORA-27302: Message 27302 not found; No message file for product=RDBMS, facility=ORA; arguments: [sskgxpcre1]
…
Sun Jun 6 21:40:52 2010
WARNING: kfk failed to open a disk[/dev/vgdata/rasm_disk5]
Sun Jun 6 21:40:52 2010
Errors in file /oracle/app/product/admin/orcl/udump/orcl1_ora_4809.trc:
ORA-15025: could not open disk '/dev/vgdata/rasm_disk5'
ORA-27041: unable to open file
HPUX-ia64 Error: 23: File table overflow
Additional information: 3
Sun Jun 6 21:40:52 2010
WARNING: kfk failed to open a disk[/dev/vgdata/rasm_disk5]
Sun Jun 6 21:40:52 2010
Errors in file /oracle/app/product/admin/orcl/udump/orcl1_ora_4809.trc:
ORA-15025: could not open disk '/dev/vgdata/rasm_disk5'
ORA-27041: unable to open file
HPUX-ia64 Error: 23: File table overflow
Additional information: 3

4、实例asm1后台日志：

引用

Sun Jun 6 21:14:26 2010
Errors in file /oracle/app/product/admin/+ASM/udump/+asm1_ora_3254.trc:
ORA-00603: Message 603 not found; No message file for product=RDBMS, facility=ORA
ORA-27504: Message 27504 not found; No message file for product=RDBMS, facility=ORA
ORA-27300: Message 27300 not found; No message file for product=RDBMS, facility=ORA; arguments: [ioctl] [23]
ORA-27301: Message 27301 not found; No message file for product=RDBMS, facility=ORA; arguments: [File table overflow]
ORA-27302: Message 27302 not found; No message file for product=RDBMS, facility=ORA; arguments: [skgxpvaddr1]

5、查看故障发生前nfile使用情况

引用

root@hisdb01:/sbin/init.d # kcusage nfile
Tunable Usage / Setting
=============================================
nfile 51795 / 65536

6、查看imon_orcl1.log

引用

2010-06-17 17:38:17.168: [ RACG][30] [9233][30][ora.orcl.orcl1.inst]: GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13

2010-06-17 17:39:17.178: [ RACG][30] [9233][30][ora.orcl.orcl1.inst]: GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
"/oracle/app/product/db10g/log/hisdb01/racg/imon_orcl.log" 158031 lines, 9229057 characters
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13

从以上日志可以看出（红色部分标出），很可能是Oracle受操作系统资源限制引发的故障。进一步查看故障发生前后操作系统资源利用情况。
1、查看nfile使用情况

引用

root@hisdb02:/ # kcusage nfile
Tunable Usage / Setting
=============================================
nfile 12089 / 65536

2、查看主机内存,CPU资源

引用

zzz ***Sun Jun 6 21:17:20 EAT 2010
         procs           memory                   page                              faults       cpu
    r     b     w      avm    free   re   at    pi   po    fr   de    sr     in     sy    cs us sy id
    1     0     0 2311458 1996119 172   18     0    0     0    0     0   2620 21313   834   0 1 99
    1     0     0 2311458 1996103 191   21     0    0     0    0     0   2408 16170   709   0 1 99
    1     0     0 2311458 1995210 166   18     0    0     0    0     0   2403 14823   700   1 0 99
zzz ***Sun Jun 6 21:17:30 EAT 2010
         procs           memory                   page                              faults       cpu
    r     b     w      avm    free   re   at    pi   po    fr   de    sr     in     sy    cs us sy id
    1     0     0 2285994 1996297 172   18     0    0     0    0     0   2620 21313   834   0 1 99
    1     0     0 2285994 1996297 171   20     0    0     0    0     0   2426 11112   710   1 1 98
    1     0     0 2285994 1995404 150   17     0    0     0    0     0   2398 10711   694   0 1 99
zzz ***Sun Jun 6 21:17:40 EAT 2010
         procs           memory                   page                              faults       cpu
    r     b     w      avm    free   re   at    pi   po    fr   de    sr     in     sy    cs us sy id
    2     0     0 2196419 1996297 172   18     0    0     0    0     0   2620 21313   834   0 1 99
    2     0     0 2196419 1995404 170   19     0    0     0    0     0   2372 10075   698   0 1 99
    2     0     0 2196419 1995386 149   17     0    0     0    0     0   2380 10401   715   0 0 100

3、查看磁盘io情况

引用

zzz ***Sun Jun 6 21:06:37 EAT 2010

device    bps     sps    msps

c1t0d0      0     0.0     1.0
c6t0d1      0     0.0     1.0
c6t0d2      0     0.0     1.0
c6t0d3      0     0.0     1.0
c6t0d4      0     0.0     1.0
c6t0d5      0     0.0     1.0
c8t0d1      0     0.0     1.0
c8t0d2      0     0.0     1.0
c8t0d3      0     0.0     1.0
c8t0d4      0     0.0     1.0
c8t0d5      0     0.0     1.0
c10t0d1      0     0.0     1.0
c10t0d2      0     0.0     1.0
c10t0d3      0     0.0     1.0
c10t0d4      0     0.0     1.0
c10t0d5      0     0.0     1.0
c12t0d1      0     0.0     1.0
c12t0d2      0     0.0     1.0
c12t0d3      0     0.0     1.0
c12t0d4      0     0.0     1.0
c12t0d5      0     0.0     1.0
c6t0d6      0     0.0     1.0
c6t0d7      0     0.0     1.0
c6t1d0      0     0.0     1.0
c6t1d1      0     0.0     1.0
c6t1d2      0     0.0     1.0
c6t1d3      0     0.0     1.0
c8t0d6      0     0.0     1.0
c8t0d7      0     0.0     1.0
c8t1d0      0     0.0     1.0
c8t1d1      0     0.0     1.0
c8t1d2      0     0.0     1.0
c8t1d3      0     0.0     1.0
c10t0d6      0     0.0     1.0
c10t0d7      0     0.0     1.0
c10t1d0      0     0.0     1.0
c10t1d1      0     0.0     1.0
c10t1d2      0     0.0     1.0
c10t1d3      0     0.0     1.0

从以上三项可以基本初步评估主机在故障发生前后的资源使用情况，可以明确的看到，在发生故障时，主机资源比较空闲。
基于此类故障，在主机资源充足的情况下，发生资源争夺（如不能获得文件句柄），很可能于Oracle bug有关。查阅Oracle 官方文档，又一未公布bug（ unpublished Bug 6931689）与此故障极为类似，详见metalink doc 739557.1。
此bug主要发生的平台为：

引用

HP-UX PA-RISC (64-bit)
HP-UX Itanium
HP IA64 HPUNIXHP 9000 Series HP-UX (64-bit)

数据库版本为：10.2.0.3 to 11.1.0.6

引用

- 10.2.0.3, 10.2.0.3 + CRS Bundle Patch #2 or CRS Bundle Patch #3
- 10.2.0.4
- 11.1.0.6

解决方法为：
在目前版本的基础上，打下列补丁之一

引用

- CRS 10.2.0.4 Bundle Patch #2 (Patch 7493592) or above. See Note 405820.1
- Latest 10.2.0.4 CRS PSU Patch as per Note 756671.1
The fix has to be applied to both CRS and RAC Database home to fix the problem.
The BUG is fixed in 11.1.0.7 and will be fixed in 10.2.0.5.

建议：
1、目前数据库版本为10.2.0.4，可以在此补丁基础上应用最新的psu patch（10.2.0.4.4）
2、调大参数nfile至131072。

0
顶

1
踩

分享到：

Tracing the Database Configuration Assis ... | HP-UX修改文件系统大文件属性

2010-06-07 16:27
浏览 4981
评论(0)
分类:数据库
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论