env:
hbase,94.26
zookeeper,3.4.3
---------------
1.downed node
this morning we found a regionserver(host-34) downed in our monitor.so we dived into the logs of hbase and found that in this host:
2016-02-29 00:50:36,799 INFO [regionserver60020-SendThread(host-04:2181)] ClientCnxn.java:1083 Client session timed out, have not heard from server in 60030ms for sessionid 0x4511b9
that means during this period,the read timeout has been touched.
this timeout is settled by this:
readtimeout=negotiatedtimeout(client[18000 by default in hbase],server[60000,90000]) * 2/3
so it's set to 60 secs.(negotiatedtimeout=90 secs)
and conn timeout:
conn timeout=negotiatedtimeout / host provider
we have five nodes of a quorum,so it's set to 90/5=18 secs.
after a few retries,(note:the start time of this connection was at 00:49:36)
host-34,hbase: 2016-02-29 00:50:36,799 INFO [regionserver60020-SendThread(host-04:2181)] ClientCnxn.java:1083 Client session timed out, have not heard from server in 60030ms for sessionid 0x4511b9 duration== 2016-02-29 00:50:44,540 INFO [regionserver60020-SendThread(host-07:2181)] ClientCnxn.java:966 Opening socket connection to server host-07/192.168.100.117:2181. Will not attempt to a uthenticate using SASL (Unable to locate a login configuration) … 2016-02-29 00:51:02,559 INFO [regionserver60020-SendThread(host-07:2181)] ClientCnxn.java:1083 Client session timed out, have not heard from server in **18790ms** for sessionid 0x4511b9 3876c000b, closing socket connection and attempting reconnect
(the period time 18 secs is computed by above connectedtimeout.)
a network paritition was shown here:
2016-02-29 00:50:40,499 WARN [regionserver60020-SendThread(host-05:2181)] ClientCnxn.java:1089 Session 0x4511b93876c000b for server null, unexpected error, closing socket connection and attempting reconnect java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
so then we dived into the last zk's log from the last request from hbase SendThread:
host-04,zookeeper: 2016-02-29 00:51:25,764 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection from /192.168.100.147:58100 2016-02-29 00:51:25,765 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@831] - Client attempting to renew session 0x4511b93876c000b at /192.168.100.147:5 8100 2016-02-29 00:51:25,765 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating client: 0x4511b93876c000b 2016-02-29 00:51:25,766 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@597] - Invalid session 0x4511b93876c000b for client /192.168.100.147:58100, probabl y expired 2016-02-29 00:51:25,766 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for client /192.168.100.147:58100 which had session nid 0x4511b93876c000b
untill here,the time period is for :51:25-49:36 ~ 110 secs >> 90 secs(timeout set above),so this session timeout was settled to expired.
that means this session should be timeouted at this timestamp:
session touch time + negotiatedtimeout = expired timestamp = 49:36 + 90 secs ~ 51:06
for sure,we found that in this zookeeper one line at this timestamp was present in front of us:
host-05,zookeeper: 2016-02-29 00:51:08,000 [myid:3] - INFO [SessionTracker:ZooKeeperServer@334] - Expiring session 0x4511b93876c000b, timeout of 90000ms exceeded
2.another node which reconnected successfully to zk during this timeout.
host-33,hbase:(so this hbase keep on running ) 2016-02-29 00:50:55,864 INFO [regionserver60020-SendThread(host-03:2181)] ClientCnxn.java:1207 Session establishment complete on server host-03/192.168.100.113:2181, sessionid = 0x4 0x4511b93876c000b
it's printed at 50:55 << 51:06,so it keeps on running with happy.
3.conclusion
a.set the connect timeout the :one / hostproviderTH for all reconnect to all nodes in this quorum
b.as of case a,so the whole connected timeout + readtimeout will longer than the negotiatedtimeout:
read time-out =negotiaedtimeout * 2/ 3 read timeout + (negotiatedtimeout / host provider ) * Nr > negotiatedtimeout Nr = retried times
so in for one extreme cases:
if all nodes in zk's quorum are disconnectable,that will cause double time of negotiaredtimeout.
c.so based on b,why set the this mechanism ?
if zk switch this solution:(this means the whole read and connect timeout are equals to negotiatedtimeout)
connecttimeout= (negotiateedtimeout- readtimeout) /hostprovider
ie.
----------negotiatedtimeout------------------||
------read--timeout-------| retry1|retry2....||
if some cases similar to here,u will lost chance to know what's wrong with hbase/zk,instead of showing :
retry connecting but timeout ,retry connecting but timeout....than the rs was shutdowned forcely ,that is it.
that means if something weird (e.g. large gc time cost that exceeds the whole negotiatedtimeout) occurs,it's needless to retry to connect other nodes in a quorum.
ref:
相关推荐
apache-zookeeper分布式框架,压缩包内容:(apache-zookeeper-3.7.1-bin.tar.gz、apache-zookeeper-3.7.1.tar.gz、apache-zookeeper-3.6.4-bin.tar.gz、apache-zookeeper-3.6.4.tar.gz、apache-zookeeper-3.5.10-...
apache-zookeeper-3.5.10-bin 环境搭配 ZooKeeper是一个分布式的,开放源码的分布式应用程序协调服务,是Google的Chubby一个开源的实现,是Hadoop和Hbase的重要组件。它是一个为分布式应用提供一致性服务的软件,...
《Zookeeper-3.4.5-cdh5.16.2:分布式协调服务的核心解析》 Apache ZooKeeper,一个高度可靠的分布式协调系统,是大数据生态中的重要组件。本资源包"zookeeper-3.4.5-cdh5.16.2.tar.gz"包含了Zookeeper的3.4.5版本...
赠送jar包:zookeeper-3.4.10.jar; 赠送原API文档:zookeeper-3.4.10-javadoc.jar; 赠送源代码:zookeeper-3.4.10-sources.jar; 赠送Maven依赖信息文件:zookeeper-3.4.10.pom; 包含翻译后的API文档:zookeeper-...
Zookeeper-3.8.0 是该系统的最新版本,提供了更稳定和高效的服务。 Zookeeper 的核心概念包括节点(Znode)、会话(Session)和观察者(Watcher)。Znode 是 Zookeeper 数据存储的基本单位,类似于文件系统中的节点...
apache-zookeeper-3.7.1 apache-zookeeper-3.7.1 apache-zookeeper-3.7.1 apache-zookeeper-3.7.1 apache-zookeeper-3.7.1 apache-zookeeper-3.7.1 apache-zookeeper-3.7.1 apache-zookeeper-3.7.1 apache-zookeeper...
2. 修改配置文件:在`zookeeper-3.4.10/conf`目录下,有一个名为`zoo_sample.cfg`的示例配置文件,复制并重命名为`zoo.cfg`,这是ZooKeeper的主要配置文件。 3. 配置`zoo.cfg`: - `dataDir`:设置ZooKeeper的数据...
这个压缩包"apache-zookeeper-3.5.7-bin.tar.gz"是Zookeeper的可执行二进制文件包,用于在Linux或Unix类操作系统上部署和运行Zookeeper服务器。 **Zookeeper的基本概念:** 1. **节点(ZNode)**:Zookeeper的数据...
apache-zookeeper-3.6.2-bin.tar apache-zookeeper-3.6.2-bin.tar apache-zookeeper-3.6.2-bin.tar apache-zookeeper-3.6.2-bin.tar apache-zookeeper-3.6.2-bin.tar apache-zookeeper-3.6.2-bin.tar apache-...
1. **下载与解压**:首先,你需要从官方或镜像站点下载Zookeeper的tar.gz压缩包,即"zookeeper-3.4.5-cdh5.15.1.tar.gz"。下载完成后,使用`tar -zxvf zookeeper-3.4.5-cdh5.15.1.tar.gz`命令进行解压。 2. **配置...
apache-zookeeper-3.6.3-bin.tar的压缩包,解压到本地即可使用,还有zk.sh的脚本以及zoo.cfg和xsync。ZooKeeper 是一个分布式协调服务 ,由 Apache 进行维护。ZooKeeper 可以视为一个高可用的文件系统。ZooKeeper ...
在 `apache-zookeeper-3.5.6-bin.tar` 这个压缩包中,包含了 Apache ZooKeeper 3.5.6 版本的二进制文件,这是部署和运行 ZooKeeper 服务所必需的。这个版本引入了多项改进和修复,使得 ZooKeeper 更加稳定和高效。 ...
ZooKeeper是一个分布式的,开放源码的分布式应用程序协调...ZooKeeper代码版本中,提供了分布式独享锁、选举、队列的接口,代码在zookeeper-3.4.3\src\recipes。其中分布锁和队列有Java和C两个版本,选举只有Java版本。
1. 解压`apache-zookeeper-3.8.4-bin.tar`到指定目录。 2. 配置`conf/zoo.cfg`,设置服务器ID、数据存储路径、集群配置等。 3. 启动Zookeeper服务,使用`bin/zkServer.sh start`命令。 4. 使用`bin/zkCli.sh`命令行...
apache-zookeeper-3.7.1-bin.tar.gz 内容概要:通过带着读者手写简化版Spring框架,了解Spring核心原理。在手写Spring源码的过程中会摘取整体框架中的核心逻辑,简化代码实现过程,保留核心功能,例如:IOC, AOP、 Bean...
赠送jar包:zookeeper-3.3.3.jar; 赠送原API文档:zookeeper-3.3.3-javadoc.jar; 赠送源代码:zookeeper-3.3.3-sources.jar; 包含翻译后的API文档:zookeeper-3.3.3-javadoc-API文档-中文(简体)版.zip 对应...
打开“系统属性” -> “高级” -> “环境变量”,在“系统变量”部分新建一个变量,变量名为"ZOOKEEPER_HOME",变量值设置为Zookeeper解压后的路径,即"C:\Zookeeper\apache-zookeeper-3.6.3-bin"。 然后,在系统...
在压缩包子文件的文件名称列表"zookeeper -3.4.6"中,我们可以推测这可能是一个完整的Zookeeper 3.4.6版本的下载包,除了"zookeeper-3.4.6.jar"外,还可能包括配置文件、文档、示例代码以及其他必要的组件。...
zookeeper-3.4.5.jar; zookeeper-3.4.5.jar; zookeeper-3.4.5.jar;
zookeeper-3.4.8.jar