Linux高可用性方案之Heartbeat日志查看(原创)

czmmiao

浏览: 4419338 次
性别:
来自: 厦门

最近访客更多访客>>

zzbing

sky3063

hotsunshine

zyi74

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Linux/Unix高可用性

redhat heartbeat log

日志是我们跟踪系统和应用程序最好的方式，在Heartbeat中日志可以自定义输出位置，只需在ha.cf文件配置即可，具体可参见笔者的
http://czmmiao.iteye.com/blog/1174010

下面跟着笔者我们来看详细看下Heartbeat的日志
启动主机Heartbeat服务

#/etc/init.d/heartbeat start
Heartbeat启动时，通过"tail -f /var/log/ messages"查看主节点系统日志信息，输出如下：
# tail -f /var/log/messages
    Nov 26 07:52:21 node1 heartbeat: [3688]: info:
    Configuration validated. Starting heartbeat 2.0.8
    Nov 26 07:52:21 node1 heartbeat: [3689]: info:
    heartbeat: version 2.0.8
    Nov 26 07:52:21 node1 heartbeat: [3689]: info:
    Heartbeat generation: 3
    Nov 26 07:52:21 node1 heartbeat: [3689]: info:
    G_main_add_TriggerHandler: Added signal manual handler
    Nov 26 07:52:21 node1 heartbeat: [3689]: info:
    G_main_add_TriggerHandler: Added signal manual handler
    Nov 26 07:52:21 node1 heartbeat: [3689]: info:
    glib: UDP Broadcast heartbeat started on port 694 (694) interface eth1
    Nov 26 07:52:21 node1 heartbeat: [3689]: info:
    glib: UDP Broadcast heartbeat closed on port 694 interface eth1 - Status: 1
    Nov 26 07:52:21 node1 heartbeat: [3689]: info:
    glib: ping heartbeat started.
    Nov 26 07:52:21 node1 heartbeat: [3689]: info:
    G_main_add_SignalHandler: Added signal handler for signal 17
    Nov 26 07:52:21 node1 heartbeat: [3689]: info:
    Local status now set to: 'up'
    Nov 26 07:52:22 node1 heartbeat: [3689]: info:
    Link node1:eth1 up.
    Nov 26 07:52:23 node1 heartbeat: [3689]: info:
    Link 192.168.60.1:192.168.60.1 up.
    Nov 26 07:52:23 node1 heartbeat: [3689]: info:
    Status update for node 192.168.60.1: status ping
此段日志是Heartbeat在进行初始化配置，例如，Heartbeat的心跳时间间隔、UDP广播端口和ping节点的运行状态等，日志信息到这里会暂停，等待120秒之后，Heartbeat会继续输出日志，而这个120秒刚好是ha.cf中"initdead"选项的设定时间。此时Heartbeat的输出信息如下：
    Nov 26 07:54:22 node1 heartbeat: [3689]: WARN: node node2: is dead
    Nov 26 07:54:22 node1 heartbeat: [3689]: info:
    Comm_now_up(): updating status to active
    Nov 26 07:54:22 node1 heartbeat: [3689]: info:
    Local status now set to: 'active'
    Nov 26 07:54:22 node1 heartbeat: [3689]: info:
    Starting child client "/usr/lib/heartbeat/ipfail" (694,694)
    Nov 26 07:54:22 node1 heartbeat: [3689]: WARN:
    No STONITH device configured.
    Nov 26 07:54:22 node1 heartbeat: [3689]: WARN:
    Shared disks are not protected.
    Nov 26 07:54:22 node1 heartbeat: [3689]: info:
    Resources being acquired from node2.
    Nov 26 07:54:22 node1 heartbeat: [3712]: info:
    Starting "/usr/lib/heartbeat/ipfail" as uid 694 gid 694 (pid 3712)
在上面这段日志中，由于node2还没有启动，因此会给出"node2: is dead"的警告信息，接下来启动了Heartbeat插件ipfail。由于我们在ha.cf文件中没有配置STONITH，因此日志里也给出了"No STONITH device configured"的警告提示。
继续看下面的日志：
    Nov 26 07:54:23 node1 harc[3713]: info: Running /etc/ha.d/rc.d/status status
    Nov 26 07:54:23 node1 mach_down[3735]: info: /usr/lib/
    heartbeat/mach_down: nice_failback: foreign resources acquired
    Nov 26 07:54:23 node1 mach_down[3735]: info: mach_down
    takeover complete for node node2.
    Nov 26 07:54:23 node1 heartbeat: [3689]: info: mach_down takeover complete.
    Nov 26 07:54:23 node1 heartbeat: [3689]: info: Initial
    resource acquisition complete (mach_down)
    Nov 26 07:54:24 node1 IPaddr[3768]: INFO: Resource is stopped
    Nov 26 07:54:24 node1 heartbeat: [3714]: info: Local Resource
    acquisition completed.
    Nov 26 07:54:24 node1 harc[3815]: info: Running /etc/ha.
    d/rc.d/ip-request-resp ip-request-resp
    Nov 26 07:54:24 node1 ip-request-resp[3815]: received ip-
    request-resp 192.168.60.200/24/eth0 OK yes
    Nov 26 07:54:24 node1 ResourceManager[3830]: info: Acquiring
    resource group: node1 192.168.60.200/24/eth0 Filesystem:
    :/dev/sdb5::/webdata::ext3
    Nov 26 07:54:24 node1 IPaddr[3854]: INFO: Resource is stopped
    Nov 26 07:54:25 node1 ResourceManager[3830]: info: Running
    /etc/ha.d/resource.d/IPaddr 192.168.60.200/24/eth0 start
    Nov 26 07:54:25 node1 IPaddr[3932]: INFO: Using calculated
    netmask for 192.168.60.200: 255.255.255.0
    Nov 26 07:54:25 node1 IPaddr[3932]: DEBUG: Using calculated
    broadcast for 192.168.60.200: 192.168.60.255
    Nov 26 07:54:25 node1 IPaddr[3932]: INFO: eval /sbin/ifconfig
    eth0:0 192.168.60.200 netmask 255.255.255.0 broadcast 192.168.60.255
    Nov 26 07:54:25 node1 avahi-daemon[1854]: Registering new
    address record for 192.168.60.200 on eth0.
    Nov 26 07:54:25 node1 IPaddr[3932]: DEBUG: Sending Gratuitous
    Arp for 192.168.60.200 on eth0:0 [eth0]
    Nov 26 07:54:26 node1 IPaddr[3911]: INFO: Success
    Nov 26 07:54:26 node1 Filesystem[4021]: INFO: Resource is stopped
    Nov 26 07:54:26 node1 ResourceManager[3830]: info: Running
    /etc/ha.d/resource.d/ Filesystem/dev/sdb5 /webdata ext3 start
    Nov 26 07:54:26 node1 Filesystem[4062]: INFO: Running start
    for /dev/sdb5 on /webdata
    Nov 26 07:54:26 node1 kernel: kjournald starting. Commit interval 5 seconds
    Nov 26 07:54:26 node1 kernel: EXT3 FS on sdb5, internal journal
    Nov 26 07:54:26 node1 kernel: EXT3-fs: mounted
    filesystem with ordered data mode.
    Nov 26 07:54:26 node1 Filesystem[4059]: INFO:
    Success
    Nov 26 07:54:33 node1 heartbeat: [3689]: info:
    Local Resource acquisition completed. (none)
    Nov 26 07:54:33 node1 heartbeat: [3689]: info:
    local resource transition completed
上面这段日志是进行资源的监控和接管，主要完成haresources文件中的设置，在这里是启用集群虚拟IP和挂载磁盘分区。
此时，通过ifconfig命令查看主节点的网络配置，可以看到，主节点已经自动绑定集群IP地址，在HA集群之外的主机上通过ping命令检测集群IP地址192.168.60.200，已经处于可通状态，也就是该地址变得可用。
同时查看磁盘分区的挂载情况，共享磁盘分区/dev/sdb5已经被自动挂载。
启动备份节点的Heartbeat
启动备份节点的Heartbeat，与主节点方法一样，使用如下命令：
#/etc/init.d/heartbeat start
或者执行：
#service heartbeat start
备用节点的Heartbeat日志输出信息与主节点相对应，通过"tail -f /var/log/messages"可以看到如下输出：
    Nov 26 07:57:15 node2 heartbeat: [2110]: info: Link node1:eth1 up.
    Nov 26 07:57:15 node2 heartbeat: [2110]: info: Status update
    for node node1: status active
    Nov 26 07:57:15 node2 heartbeat: [2110]: info: Link node1:eth0 up.
    Nov 26 07:57:15 node2 harc[2123]: info: Running /etc/ha.d/rc.d/status status
    Nov 26 07:57:15 node2 heartbeat: [2110]: info: Comm_now_up():
    updating status to active
    Nov 26 07:57:15 node2 heartbeat: [2110]: info: Local
    status now set to: 'active'
    Nov 26 07:57:15 node2 heartbeat: [2110]: info: Starting
    child client "/usr/lib/heartbeat/ipfail" (694,694)
    Nov 26 07:57:15 node2 heartbeat: [2110]: WARN: G
    _CH_dispatch_int: Dispatch function for read child
    took too long to execute: 70 ms (> 50 ms) (GSource: 0x8f62080)
    Nov 26 07:57:15 node2 heartbeat: [2134]: info:
    Starting "/usr/lib/heartbeat/ipfail" as uid 694 gid 694 (pid 2134)
备份节点检测到node1处于活动状态，没有可以接管的资源，因此仅仅启动了网络监听插件ipfail，监控主节点的心跳。

测试Heartbeat
如何才能得知HA集群是否正常工作，模拟环境测试是个不错的方法。在把Heartbeat高可用性集群放到生产环境中之前，需要做如下5个步骤的测试，从而确定HA是否正常工作。
1、正常关闭和重启主节点的Heartbeat
首先在主节点node1上执行"service heartbeat stop"正常关闭主节点的Heartbeat进程，此时通过ifconfig命令查看主节点网卡信息。正常情况下，应该可以看到主节点已经释放了集群的服务IP地址，同时也释放了挂载的共享磁盘分区。然后查看备份节点，现在备份节点已经接管了集群的服务IP，同时也自动挂载上了共享的磁盘分区。
在这个过程中，使用ping命令对集群服务IP进行测试。可以看到，集群IP一致处于可通状态，并没有任何延时和阻塞现象，也就是说在正常关闭主节点的情况下，主备节点的切换是无缝的，HA对外提供的服务也可以不间断运行。
接着，将主节点Heartbeat正常启动。Heartbeat启动后，备份节点是否自动释放资源将取决于auto_failback 中的设置，本文将顶设置为on，备份节点将自动释放资源，而主节点将再次接管集群资源。其实备份节点释放资源与主节点绑定资源是同步进行的，因而，这个过程也是一个无缝切换。
2、在主节点上拔去网线
拔去主节点连接公共网络的网线后，Heartbeat插件ipfail通过ping测试可以立刻检测到网络连接失败，接着自动释放资源。而就在此时，备用节点的ipfail插件也会检测到主节点出现网络故障，在等待主节点释放资源完毕后，备用节点马上接管了集群资源，从而保证了网络服务不间断持续运行。
同理，当主节点网络恢复正常时，由于设置了"auto_failback on"选项，集群资源将自动从备用节点切会主节点。
在主节点拔去网线后日志信息如下：
    Nov 26 09:04:09 node1 heartbeat: [3689]: info: Link node2:eth0 dead.
    Nov 26 09:04:09 node1 heartbeat: [3689]: info:
    Link 192.168.60.1:192.168.60.1 dead.
    Nov 26 09:04:09 node1 ipfail: [3712]: info: Status update:
    Node 192.168.60.1 now has status dead
    Nov 26 09:04:09 node1 harc[4279]: info: Running /etc/ha.d/rc.d/status status
    Nov 26 09:04:10 node1 ipfail: [3712]: info: NS: We are dead. :<
    Nov 26 09:04:10 node1 ipfail: [3712]: info: Link Status
    update: Link node2/eth0 now has status dead
    …… 中间部分省略 ……
    Nov 26 09:04:20 node1 heartbeat: [3689]: info: node1 wants to go standby [all]
    Nov 26 09:04:20 node1 heartbeat: [3689]: info: standby:
    node2 can take our all resources
    Nov 26 09:04:20 node1 heartbeat: [4295]: info: give up all
    HA resources (standby).
    Nov 26 09:04:21 node1 ResourceManager[4305]: info: Releasing
    resource group: node1 192.168.60.200/24/eth0 Filesystem::/dev/sdb5::/webdata::ext3
    Nov 26 09:04:21 node1 ResourceManager[4305]: info: Running
    /etc/ha.d/resource.d/ Filesystem/dev/sdb5 /webdata ext3 stop
    Nov 26 09:04:21 node1 Filesystem[4343]: INFO: Running stop for /dev/sdb5 on /webdata
   Nov 26 09:04:21 node1 Filesystem[4343]: INFO: Trying to unmount /webdata
    Nov 26 09:04:21 node1 Filesystem[4343]: INFO: unmounted /webdata successfully
    Nov 26 09:04:21 node1 Filesystem[4340]: INFO: Success
    Nov 26 09:04:22 node1 ResourceManager[4305]: info: Running
    /etc/ha.d/resource.d/IPaddr 192.168.60.200/24/eth0 stop
    Nov 26 09:04:22 node1 IPaddr[4428]: INFO: /sbin/ifconfig eth0:0 192.168.60.200 down
    Nov 26 09:04:22 node1 avahi-daemon[1854]: Withdrawing
    address record for 192.168.60.200 on eth0.
    Nov 26 09:04:22 node1 IPaddr[4407]: INFO: Success
备用节点在接管主节点资源时的日志信息如下：
    Nov 26 09:02:58 node2 heartbeat: [2110]: info: Link node1:eth0 dead.
    Nov 26 09:02:58 node2 ipfail: [2134]: info: Link Status
    update: Link node1/eth0 now has status dead
    Nov 26 09:02:59 node2 ipfail: [2134]: info: Asking
    other side for ping node count.
    Nov 26 09:02:59 node2 ipfail: [2134]: info: Checking remote count of ping nodes.
    Nov 26 09:03:02 node2 ipfail: [2134]: info: Telling other
    node that we have more visible ping nodes.
    Nov 26 09:03:09 node2 heartbeat: [2110]: info: node1
    wants to go standby [all]
    Nov 26 09:03:10 node2 heartbeat: [2110]: info: standby:
    acquire [all] resources from node1
    Nov 26 09:03:10 node2 heartbeat: [2281]: info: acquire all HA resources (standby).
    Nov 26 09:03:10 node2 ResourceManager[2291]: info: Acquiring
    resource group: node1 192.168.60.200/24/eth0 Filesystem::/dev/sdb5::/webdata::ext3
    Nov 26 09:03:10 node2 IPaddr[2315]: INFO: Resource is stopped
    Nov 26 09:03:11 node2 ResourceManager[2291]: info: Running
    /etc/ha.d/resource.d/IPaddr 192.168.60.200/24/eth0 start
    Nov 26 09:03:11 node2 IPaddr[2393]: INFO: Using calculated
    netmask for 192.168.60.200: 255.255.255.0
    Nov 26 09:03:11 node2 IPaddr[2393]: DEBUG: Using calculated
    broadcast for 192.168.60.200: 192.168.60.255
    Nov 26 09:03:11 node2 IPaddr[2393]: INFO: eval /sbin/ifconfig
    eth0:0 192.168.60.200 netmask 255.255.255.0 broadcast 192.168.60.255
    Nov 26 09:03:12 node2 avahi-daemon[1844]: Registering new
    address record for 192.168.60.200 on eth0.
    Nov 26 09:03:12 node2 IPaddr[2393]: DEBUG: Sending Gratuitous
    Arp for 192.168.60.200 on eth0:0 [eth0]
    Nov 26 09:03:12 node2 IPaddr[2372]: INFO: Success
    Nov 26 09:03:12 node2 Filesystem[2482]: INFO: Resource is stopped
    Nov 26 09:03:12 node2 ResourceManager[2291]: info: Running
    /etc/ha.d/resource.d/ Filesystem/dev/sdb5 /webdata ext3 start
    Nov 26 09:03:13 node2 Filesystem[2523]: INFO: Running start for /dev/sdb5 on /webdata
    Nov 26 09:03:13 node2 kernel: kjournald starting. Commit interval 5 seconds
    Nov 26 09:03:13 node2 kernel: EXT3 FS on sdb5, internal journal
    Nov 26 09:03:13 node2 kernel: EXT3-fs: mounted filesystem with ordered data mode.
    Nov 26 09:03:13 node2 Filesystem[2520]: INFO: Success
3、在主节点上拔去电源线
在主节点拔去电源后，备用节点的Heartbeat进程会立刻收到主节点已经shutdown的消息。如果在集群上配置了Stonith设备，那么备用节点将会把电源关闭或者复位到主节点。当Stonith设备完成所有操作时，备份节点才能拿到接管主节点资源的所有权，从而接管主节点的资源。
在主节点拔去电源后，备份节点有类似如下的日志输出：
    Nov 26 09:24:54 node2 heartbeat: [2110]: info:
    Received shutdown notice from 'node1'.
    Nov 26 09:24:54 node2 heartbeat: [2110]: info:
    Resources being acquired from node1.
    Nov 26 09:24:54 node2 heartbeat: [2712]: info:
    acquire local HA resources (standby).
    Nov 26 09:24:55 node2 ResourceManager[2762]:
    info: Running /etc/ha.d/resource.d/IPaddr 192.168.60.200/24/eth0 start
    Nov 26 09:24:57 node2 ResourceManager[2762]:
    info: Running /etc/ha.d/resource.d/ Filesystem /dev/sdb5 /webdata ext3 start
4、切断主节点的所有网络连接
在主节点上断开心跳线后，主备节点都会在日志中输出"eth1 dead"的信息，但是不会引起节点间的资源切换。此时出现脑裂现象，主备节点都认为对方宕机，如果再次拔掉主节点连接公共网络的网线，网络资源由于连通性问题切换到备机，但存储资源没有顺利切换到备机。此时，就会重现很严重的问题，即备机点对外提供服务，但主节点挂载存储，很有可能导致数据的不一致。连上主节点的心跳线，观察系统日志，可以看到，备用节点的Heartbeat进程将会重新启动，进而再次控制集群资源。最后，连上主节点的对外网线，集群资源再次从备用节点转移到主节点。这就是整个切换过程。
5、在主节点上非正常关闭Heartbeat守护进程
在主节点上可通过"killall -9 heartbeat"命令关闭Heartbeat进程。由于是非法关闭Heartbeat进程，因此Heartbeat所控制的资源并没有释放。备份节点在很短一段时间没有收到主节点的响应后，就会认为主节点出现故障，进而接管主节点资源。在这种情况下，就出现了资源争用情况，两个节点都占用一个资源，造成数据冲突。针对这个情况，可以通过Linux提供的内核监控模块watchdog来解决这个问题，将watchdog集成到Heartbeat中。如果Heartbeat异常终止，或者系统出现故障，watchdog都会自动重启系统，从而释放集群资源，避免了数据冲突的发生。
本章节我们没有配置watchdog到集群中，如果配置了watchdog，在执行"killall -9 heartbeat"时，会在/var/log/messages中看到如下信息：
    Softdog: WDT device closed unexpectedly. WDT will not stop!
这个错误告诉我们，系统出现问题，将重新启动。

日志中的bug

在Heartbeat 2.0.7版本中，如果启用了crm，则会出现如下报错

ccm[22165]: 2011/08/30_15:18:29 ERROR: REASON: can't send message to IPC: Success
           cib[22166]: 2011/08/30_15:18:29 WARN: validate_cib_digest:io.c No on-disk digest present
           ccm[22165]: 2011/08/30_15:18:29 ERROR: Initialization failed. Exit
           cib[22166]: 2011/08/30_15:18:29 info: readCibXmlFile: [on-disk] <cib admin_epoch="0" epoch="0" num_updates="0" have_quorum="false">
           heartbeat[22155]: 2011/08/30_15:18:29 WARN: Exiting /usr/lib64/heartbeat/ccm process 22165 returned rc 1.
           cib[22166]: 2011/08/30_15:18:29 info: readCibXmlFile: [on-disk] <configuration>
           heartbeat[22155]: 2011/08/30_15:18:29 ERROR: Respawning client "/usr/lib64/heartbeat/ccm":
           heartbeat[22155]: 2011/08/30_15:18:28 WARN: G_SIG_dispatch: Dispatch function for SIGCHLD took too long to execute: 150 ms (> 10 ms) (GSource: 0x156837f8)
           heartbeat[22155]: 2011/08/30_15:18:28 WARN: duplicate client add request [ccm] [22165]
           heartbeat[22155]: 2011/08/30_15:18:28 ERROR: api_process_registration_msg: cannot add client()
有英文网站解释如下，处于笔者英文水平有限，无法准确翻译，请读者自己理解
This is a bug in the heartbeat API library. I'm pretty sure it has an implicit assumption that no child will connect via the heartbeat API if its parent connected before the fork, and there was no intervening exec() call...
具体网址：http://lists.linux-ha.org/pipermail/linux-ha-dev/2005-September/011785.html

参考至:http://book.51cto.com/art/200912/168038.htm

http://lists.linux-ha.org/pipermail/linux-ha-dev/2005-September/011785.html

本文原创，转载请注明出处、作者

如有错误，欢迎指正

邮箱:czmcj@163.com

0
顶

1
踩

分享到：

Linux高可用性方案之Heartbeat的Stonith ... | Linux高可用性方案之Heartbeat安装(原创 ...

2011-09-17 22:22
浏览 13470
评论(0)
分类:操作系统
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论