配了个haHadoo集群,手动kill -9了1号机的namenode,发现2号不能自动变为active,查看日志报错:
写道
2018-10-31 14:11:02,098 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing Process... ======
2018-10-31 14:11:02,098 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
2018-10-31 14:11:02,099 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to create SSH session
com.jcraft.jsch.JSchException: java.io.FileNotFoundException: /home/root/.ssh/id_rsa (No such file or directory)
at com.jcraft.jsch.KeyPair.load(KeyPair.java:543)
at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:40)
at com.jcraft.jsch.JSch.addIdentity(JSch.java:407)
at com.jcraft.jsch.JSch.addIdentity(JSch.java:367)
at org.apache.hadoop.ha.SshFenceByTcpPort.createSession(SshFenceByTcpPort.java:122)
at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:91)
at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:532)
at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:921)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:820)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.io.FileNotFoundException: /home/root/.ssh/id_rsa (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at java.io.FileInputStream.<init>(FileInputStream.java:93)
at com.jcraft.jsch.Util.fromFile(Util.java:508)
at com.jcraft.jsch.KeyPair.load(KeyPair.java:540)
... 15 more
2018-10-31 14:11:02,099 WARN org.apache.hadoop.ha.NodeFencer: Fencing method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
2018-10-31 14:11:02,099 ERROR org.apache.hadoop.ha.NodeFencer: Unable to fence service by any configured method.
2018-10-31 14:11:02,099 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election
java.lang.RuntimeException: Unable to fence NameNode at hadoop1/192.168.150.151:9000
at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:921)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:820)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2018-10-31 14:11:02,099 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2018-10-31 14:11:02,102 INFO org.apache.zookeeper.ZooKeeper: Session: 0x166c8a424df00fb closed
2018-10-31 14:11:03,102 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@14671dfe
2018-10-31 14:11:03,103 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hadoop2/192.168.150.152:2181. Will not attempt to authenticate using SASL (unknown error)
2018-10-31 14:11:03,104 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to hadoop2/192.168.150.152:2181, initiating session
2018-10-31 14:11:03,106 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server hadoop2/192.168.150.152:2181, sessionid = 0x266c8a421c9010f, negotiated timeout = 5000
2018-10-31 14:11:03,106 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2018-10-31 14:11:03,107 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2018-10-31 14:11:03,107 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2018-10-31 14:11:03,108 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a086d796861646f6f7012036e6e311a076861646f6f703120a84628d33e
2018-10-31 14:11:03,108 INFO org.apache.hadoop.ha.ZKFailoverController: Should fence: NameNode at hadoop1/192.168.150.151:9000
2018-10-31 14:11:02,098 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
2018-10-31 14:11:02,099 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to create SSH session
com.jcraft.jsch.JSchException: java.io.FileNotFoundException: /home/root/.ssh/id_rsa (No such file or directory)
at com.jcraft.jsch.KeyPair.load(KeyPair.java:543)
at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:40)
at com.jcraft.jsch.JSch.addIdentity(JSch.java:407)
at com.jcraft.jsch.JSch.addIdentity(JSch.java:367)
at org.apache.hadoop.ha.SshFenceByTcpPort.createSession(SshFenceByTcpPort.java:122)
at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:91)
at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:532)
at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:921)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:820)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.io.FileNotFoundException: /home/root/.ssh/id_rsa (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at java.io.FileInputStream.<init>(FileInputStream.java:93)
at com.jcraft.jsch.Util.fromFile(Util.java:508)
at com.jcraft.jsch.KeyPair.load(KeyPair.java:540)
... 15 more
2018-10-31 14:11:02,099 WARN org.apache.hadoop.ha.NodeFencer: Fencing method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
2018-10-31 14:11:02,099 ERROR org.apache.hadoop.ha.NodeFencer: Unable to fence service by any configured method.
2018-10-31 14:11:02,099 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election
java.lang.RuntimeException: Unable to fence NameNode at hadoop1/192.168.150.151:9000
at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:921)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:820)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2018-10-31 14:11:02,099 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2018-10-31 14:11:02,102 INFO org.apache.zookeeper.ZooKeeper: Session: 0x166c8a424df00fb closed
2018-10-31 14:11:03,102 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@14671dfe
2018-10-31 14:11:03,103 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hadoop2/192.168.150.152:2181. Will not attempt to authenticate using SASL (unknown error)
2018-10-31 14:11:03,104 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to hadoop2/192.168.150.152:2181, initiating session
2018-10-31 14:11:03,106 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server hadoop2/192.168.150.152:2181, sessionid = 0x266c8a421c9010f, negotiated timeout = 5000
2018-10-31 14:11:03,106 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2018-10-31 14:11:03,107 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2018-10-31 14:11:03,107 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2018-10-31 14:11:03,108 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a086d796861646f6f7012036e6e311a076861646f6f703120a84628d33e
2018-10-31 14:11:03,108 INFO org.apache.hadoop.ha.ZKFailoverController: Should fence: NameNode at hadoop1/192.168.150.151:9000
主要是这个
写道
com.jcraft.jsch.JSchException: java.io.FileNotFoundException: /home/root/.ssh/id_rsa (No such file or directory)
测试ssh几台机器都可以相互免密ssh登录,后来发现hdfs-site.xml 中的sshfence配置是用来通过 ssh 登录到前一个 active NameNode上将其补刀杀死用的,以便于确定只有一个 active NameNode,dfs.ha.fencing.ssh.private-key-files是配置本机私钥文件的存放地址,我的私钥地址配置错了,所以无法补刀,所以备用的namenode不能确定只有它活着,所以不敢转为active状态。
我的秘钥放在/root/.ssh/id_rsa,而之前配置成了 /home/root/.ssh/id_rsa
相关推荐
HDFS HA架构图显示,Active NameNode通过JournalNode与Standby NameNode同步元数据,Zookeeper集群的ZKFC(Zookeeper Failover Controller)用于检测和执行NameNode之间的切换。 类似的,YARN HA使用相同的原理,...
在手动模式下,当Active NameNode发生故障时,管理员需要手动将另一台NameNode切换为Active状态。而在自动模式下,通过ZooKeeper的Failover Controller (ZKFC) 实现自动故障转移。ZKFC监控NameNode的状态,并在检测...
HA机制可以使得Hadoop集群在NameNode故障时,自动切换到备用的NameNode,以确保集群的运行不中断。 Hadoop HA的实现主要依赖于ZooKeeper。ZooKeeper是一个分布式协调服务,提供了一种高效、可靠的方式来管理分布式...
Hadoop HA集群方案中还提供了一些可选的机制,比如secondary NameNode或者CheckpointNode,这些可以用于定期合并fsimage和editlog,以生成新的fsimage,但是它们并不能用于实际的热备切换,只作为数据恢复的手段。...
HDFS HA主要通过Active/Passive模式实现,即在一个时刻,只有一个NameNode处于Active状态,负责处理所有的客户端请求,另一个NameNode则处于Standby状态,实时同步Active NameNode的状态,一旦Active NameNode出现...
在 HA 集群启动后,测试故障切换机制,模拟主 NameNode 或 ResourceManager 故障,观察备用节点是否能快速接管,服务是否能够恢复正常。 ### 结论 理解并成功执行 Hadoop HA 集群的启动流程对于构建高可用、容错性...
【Hadoop HA集群部署】是高可用(High Availability)配置的一种,主要目的是为了确保Hadoop分布式文件系统在遇到单点故障时,能够自动切换到备用节点,保证服务不间断。这通常涉及到NameNode HA,Secondary NameNode...
在构建高可用(HA)Hadoop集群时,HDFS(Hadoop Distributed File System)HA配置是关键步骤,目的是确保即使NameNode节点发生故障,数据访问和服务也不会中断。本教程将详细讲解如何配置、启动和验证Hadoop HA集群...
首先,启动Hadoop集群涉及到的主要组件包括NameNode、DataNode、JournalNode、DFSZKFailoverController、ResourceManager和NodeManager。启动流程一般按照以下步骤进行: 1. 启动NameNode,它是HDFS的主要元数据...
通常配置奇数个JournalNode,这里还配置了一个Zookeeper集群,用于ZKFC故障转移,当Active NameNode挂掉了,会自动切换Standby NameNode为Active状态。 YARN的ResourceManager也存在单点故障问题,这个问题在hadoop-...
6. 启动Hadoop集群,包括DataNode、NameNode、Zookeeper等服务。 **三、HA集群验证** 验证Hadoop HA集群的正确性有三种常见方法: 1. **Web界面验证**:通过NameNode的HTTP地址访问HDFS Web UI,确认两个NameNode...
Hadoop2.2.0版本 - 虚拟机VMWare - Linux(ubuntu) ,多节点伪...3、这里还配置了一个zookeeper集群,用于ZKFC(DFSZKFailoverController)故障转移,当Active NameNode挂掉了,会自动切换Standby NameNode为active状态。
Hadoop 高可用集群性主备配置是指在 HDFS 中实现高可用性(HA)的配置方法,以解决 NameNode 的单点故障问题。HA 配置通过配置 Active/Standby 两个 NameNodes 实现 NameNode 的热备,确保集群的可用性。 高可用性...
- **故障检测和自动切换**:系统自动检测Active NameNode的故障,并启动切换流程。 **总结** Hadoop的高可用性HA部署旨在消除单点故障,提供不间断的服务。通过Active/Standby NameNode、Zookeeper、JournalNodes等...
Hadoop HA 需要两个 NameNode 和两个 ResourceManager,分别处于活动(Active)和备用(Standby)状态。当活动节点发生故障时,备用节点能够无缝接管,保证服务不间断。在配置过程中,需要注意以下几个关键点: 1. ...
Hadoop HA(High Availability)是指通过在集群中部署多个NameNode实例来提高Hadoop集群的可用性。通常情况下,一个集群包含两个NameNode实例:一个是主动节点(Active),另一个是备用节点(Standby)。当主动节点出现...
9. **测试HA**:验证NameNode和ResourceManager的自动故障切换功能,例如通过杀死主NameNode进程,观察备NameNode是否能接管。 10. **监控与日志管理**:配置日志聚合工具如Log4j或Fluentd,以便集中管理和分析...
在IT行业中,Hadoop HA(高可用性)是大数据处理领域的一个重要概念,它确保了Hadoop集群在主节点故障时能够无缝切换到备份节点,从而保持服务的连续性和数据的完整性。本资源包提供了搭建Hadoop HA所需的关键组件,...