Problem(Abstract)
After setting up the HTTP plug-in for load balancing in a clustered IBM® WebSphere® environment, the HTTP plug-in is not performing failover in a timely manner or at all when a cluster member becomes unavailable.
Cause
In most cases, the preceding behavior is observed because of a misunderstanding of how HTTP plug-in failover works or might be due to an improper configuration. Also, the type of Web server (multi-threaded versus single threaded) being used can affect this behavior.
Resolving the problem
The following document is designed to assist you in understanding how HTTP plug-in failover works, along with providing you some helpful tuning parameters and suggestions to better maximize the ability of the HTTP plug-in to failover effectively and in a timely manner.
Note: The following information is written specifically for the IBM HTTP Server, however, this information in general is applicable to other Web servers which currently support the HTTP plug-in (for example: IIS, SunOne, Domino®, and so on).
Failover
Background
In clustered IBM WebSphere Application Server environments, the HTTP plug-in has the ability to provide failover in the event the HTTP plug-in is no longer able to send requests to a particular cluster member. By default, there are several conditions under which the HTTP plug-in will mark a particular cluster member down and failover client requests to another cluster member that is still able to receive connections. They are listed as follows:
The HTTP plug-in is unable to establish a connection to a cluster member's Application Server transport.
The HTTP plug-in detects a newly connected socket that was prematurely closed by a cluster member during an active read or write.
There are several configurable settings in the plugin-cfg.xml that can be tuned to affect how quickly the HTTP plug-in will mark a cluster member down and failover to another cluster member.
ConnectTimeout
The ConnectTimeout attribute of a Server element enables the HTTP plug-in to perform non-blocking connections with a backend cluster member. Non-blocking connections are beneficial when the HTTP plug-in is unable to contact the destination to determine if the port is available or unavailable for a particular cluster member.
<server cloneid="10k66djk2" connecttimeout="10" extendedhandshake="false" loadbalanceweight="1000" maxconnections="0" name="Server1_WebSphere_Appserver" waitforcontinue="false"><transport hostname="server1.domain.com" port="9091" protocol="http"></transport></server>
If no ConnectTimeout value is specified, the HTTP plug-in performs a blocking connect in which the HTTP plug-in sits until an operating system TCP timeout occurs (as long as 2 minutes depending on the platform) and allows the HTTP plug-in to mark the cluster member unavailable. A value of 0 causes the HTTP plug-in to perform a blocking connect. A value greater than 0 specifies the number of seconds you want the HTTP plug-in to wait for a successful connection. If a connection does not occur after that time interval, the HTTP plug-in marks the cluster member unavailable and fails over to one of the other cluster members defined in the cluster.
Caution: In an environment with busy workload or a slow network connection, setting this value too low could make the HTTP plug-in mark a cluster member down falsely. Therefore, caution should be used whenever choosing a value for ConnectTimeout.
ServerIOTimeout
The ServerIOTimeout attribute of a server element enables the HTTP plug-in to set a time out value, in seconds, for sending requests to and reading responses from a cluster member. If a value is not set for the ServerIOTimeout attribute, the HTTP plug-in, by default, uses blocked I/O to write request to and read responses from the cluster member until the TCP connection times out. For example, if you specify:
<server cloneid="10k66djk2" serveriotimeout="120" connecttimeout="10" extendedhandshake="false" loadbalanceweight="1000" maxconnections="0" name="Server1_WebSphere_Appserver" waitforcontinue="false"><transport hostname="server1.domain.com" port="9091" protocol="http"></transport></server>
In this case, if a cluster member stops responding to requests, the HTTP plug-in waits 120 seconds (2 minutes) before timing out the TCP connection. Setting the ServerIOTimeout attribute to a reasonable value enables the HTTP plug-in to time out the connection sooner, and transfer requests to another cluster member when possible.
When selecting a value for this attribute, remember that sometimes it might take a couple of minutes for a cluster member to process a request. Setting the value of the ServerIOTimeout attribute too low could cause the HTTP plug-in to send a false server error response to the client.
The ServerIOTimeout is ideal for situations where Keep-Alive connections exist between the WebSphere Application Server and HTTP plug-in, and the Application Server machine is abruptly disconnected from the network.
For example, without ServerIOTimeout, the HTTP plug-in would take a long time to detect that the connection was closed abruptly on the WebSphere Application Server machine. This is illustrated as follows:
When an application host machine is shut down abruptly, the Keep-Alive connections between HTTP plug-in and Application Server might not get closed completely. As a result, when the HTTP plug-in needs to route a request to the host machine, the HTTP plug-in would use an existing Keep-Alive connection if there was one in the pool. When plug-in sends the request over such a connection, since the host machine had been taken down abruptly, the HTTP plug-in machine does not receive any TCP packets to close the connection. The HTTP plug-in request writing would not return a failure until the connection timed out at the TCP level. The HTTP Plug-in would then try to contact to the same application server by establishing a new connection. The connect() call would then fail after the TCP timeout. As a result, it could take a considerable amount of time depending on the operating system TCP timeout setting for the HTTP plug-in to detect the application server status and mark it down before failing over to another application server. If there were many requests sent to the server during this time, this fact would apply to every request.
Note: To avoid the preceding behavior, ServerIOTimeout attribute was introduced with APAR PQ96015 and included in WebSphere Application Server V5.0.2.10 and 5.1.1.4.
Caution: When both ConnecTimeout and ServerIOTimeout are specified, it could take as long as (ConnecTimeout + ServerIOTimeout) for the HTTP plug-in to detect and mark a server down.
RetryInterval
An integer specifying the length of time that should elapse from the time that a server is marked down to the time that the HTTP plug-in will retry a connection. The default is 60 seconds.
This setting is specified in the ServerCluster element. An example of this in the plugin-cfg.xml file is as follows:
<servercluster cloneseparatorchange="false" loadbalance="Round Robin" name="Server_WebSphere_Cluster" postsizelimit="10000000" removespecialheaders="true" retryinterval="120">
This would mean that if a cluster member were marked as down, the HTTP plug-in would not retry it for 120 seconds.
There is no way to recommend one specific value; the value chosen depends on your environment. For example, if you have numerous cluster members, and one cluster member being unavailable does not affect the performance of your application, then you can safely set the value to a very high number.
Alternatively, if your optimum load has been calculated assuming all cluster members to be available or if you do not have very many, then you will want your cluster members to be retried more often to maintain the load.
Also, take into consideration the time it takes to restart your server. If a server takes a long time to boot up and load applications, then you will need a longer retry interval.
PrimaryServers versus BackupServers
The HTTP plug-in can be configured for true failover by using PrimaryServers and BackupServers Elements in the plugin-cfg.xml configuration file.
In the following example, the plug-in will load balance between both servers, Server1_WebSphere_Appserver and Server2_WebSphere_Appserver defined in PrimaryServers element only. However, in the event that bothServer1_WebSphere_Appserver and Server1_WebSphere_Appserver become unavailable and marked down, the HTTP plug-in will then failover and start sending requests to Server3_WebSphere_Appserver defined in the BackupServers Element.
<servercluster cloneseparatorchange="false" loadbalance="Round Robin" name="Server_WebSphere_Cluster" postsizelimit="10000000" removespecialheaders="true" retryinterval="120"><server cloneid="10k66djk2" serveriotimeout="120" connecttimeout="10" extendedhandshake="false" loadbalanceweight="1000" maxconnections="0" name="Server1_WebSphere_Appserver" waitforcontinue="false"><transport hostname="server1.domain.com" port="9091" protocol="http"></transport></server><server cloneid="10k67eta9" serveriotimeout="120" connecttimeout="10" extendedhandshake="false" loadbalanceweight="999" maxconnections="0" name="Server2_WebSphere_Appserver" waitforcontinue="false"><transport hostname="server2.domain.com" port="9091" protocol="http"></transport></server><server cloneid="10k68xtw10" serveriotimeout="120" connecttimeout="10" extendedhandshake="false" loadbalanceweight="998" maxconnections="0" name="Server3_WebSphere_Appserver" waitforcontinue="false"><transport hostname="server3.domain.com" port="9091" protocol="http"></transport></server><primaryservers><server name="Server1_WebSphere_Appserver"></server><server name="Server2_WebSphere_Appserver"></server></primaryservers><backupservers><server name="Server3_WebSphere_Appserver"></server></backupservers></servercluster></servercluster>
分享到:
相关推荐
《STK-Disk913x-Using Application-Transparent Failover (ATF) - IBM, Sun》 本手册主要涉及的是StorageTek公司的STK-Disk913x磁盘存储系统,以及如何利用Application-Transparent Failover (ATF)技术在IBM和Sun...
《STK-Disk914x-Application-Transparent Failover Installation and Operation Manual for Sun Solaris Environments》 本文档详细介绍了在Sun Solaris环境中STK-Disk914x应用透明故障切换(Application-...
在数据库实例上启用TAF,并在客户端连接字符串中指定`FAILOVER_TYPE=SELECT`和`FAILOVER_MODE=(RETRY, METHOD=BASIC, FAILOVER_ON_ERROR=TRUE)`。 5. **设置FSFO触发条件**:使用`ALTER SYSTEM SET FAST_START_...
master_ip_failover 通常用于在故障转移过程中更新 IP 地址,以便客户端能够无缝地连接到新的主服务器。它可以通过修改 DNS 记录、更新配置文件或使用 IP 地址管理工具来实现 IP 的切换。 使用 master_ip_failover ...
这就是所谓的"故障转移"(Failover)机制。"Go-go.failover"库正是为此目的而设计的,它提供了一种高效且可靠的解决方案,帮助开发者在出现错误或故障时,能够无缝地切换到备用服务,从而不间断地完成请求的处理。 ...
simple-failover-java A simple failover library for Java. 用于构建高性能的客户端(主调方)自适应负载均衡和自动重试能力。 jdk1.8+ Get Started // 添加多个被调用资源,这里的被调用资源是指目标服务器(有多...
### Oracle RAC Failover 详解 #### 一、概述 Oracle Real Application Clusters (RAC) 是一种集群数据库解决方案,旨在提供高可用性和负载均衡功能。其中,高可用性的一个核心组成部分是 Failover(故障转移)...
气流计划程序故障转移控制器项目目的该项目的目的是创建一个故障转移控制器,该控制器将控制哪个调度程序已启动并正在运行,以允许跨整个Airflow集群进行HA。动机我们尝试设置一个高可用性气流集群,其中有两台运行...
Failover LAN Interface: N/A - Serial-based failover enabled Unit Poll frequency 15 seconds, holdtime 45 seconds Interface Poll frequency 5 seconds, holdtime 25 seconds Interface Policy 1 Monitored ...
ec2-nat-failover 在EC2中启用高可用性IP和NAT管理的脚本。 问题:您的EC2 VPC中需要使用静态IP进行传入和传出流量,因此您在每个可用区域中创建了NAT /反向代理实例,将弹性IP与每个实例相关联,并使用它们将所有...
"failover-manager-master.zip_failover" 提到的“failover”机制正是为了实现这一目标,确保在主节点发生故障时,系统能够自动切换到备用节点,以保持服务的连续性和数据的一致性。 **Failover(故障转移)** 是一...
murmur-failover-daemon通过定期将主数据库同步到从数据库来工作。 同时,不断对主机进行ping操作,以查看主机是否启动。 如果主服务器出现故障,则从服务器将启动,并开始接受连接。 与主服务器断开连接的Mumble...
mpathd是Linux的网络接口故障转移守护程序。 守护程序将监视主网络接口。 如果主接口发生故障,它将打开辅助接口并将所有IPv4信息移至该接口。
MyFSys是一款开源的MySQL故障转移系统,版本1.2,其主要目标是提供一种解决方案,以确保MySQL数据库服务在面临故障时仍能保持24小时×7天的连续运行。这种系统对于那些对数据可用性和业务连续性有高要求的企业至关...
# In your class: package MyClass; use Moo; use MooX::Failover; has 'attr' => ( ... ); # after attributes are defined: failover_to 'OtherClass'; ... # When using the class my $obj = MyClass->new( %...
为了确保网络的连续性和稳定性,"eos-bp-failover"脚本应运而生,它是一个专门设计用于EOS BP故障转移的工具。 **脚本概述** "eos-bp-failover"脚本的主要目标是自动化处理BP节点故障的情况,包括监测节点状态、...
docker-eos-故障转移 该脚本使用docker engine http-rest-api控制docker容器。 每天备份数据目录 检查节点状态,从不正常的关闭状态使用备份数据目录自动恢复 数据目录结构 ls /home/ubuntu/data-...node failover.js
### Hyper-V与故障转移群集:虚拟机存储与网络故障检测技术详解 #### 虚拟化技术概览 在当今的数据中心环境中,虚拟化已成为提高服务器利用率、简化管理并增强业务连续性的关键技术之一。Microsoft 的 Hyper-V ...