This guide provides an overview of YARN ResourceManager High Availability (RM HA), and explains how to configure and use this feature.
- Unplanned events such as machine crashes
- Planned maintenance events such as software or hardware upgrades on the machine running the ResourceManager.
Continue reading:
Architecture
ResourceManager HA is implemented by means of an Active/Standby pair of ResourceManagers. On start-up, each RM is in the Standby state: the process is started, but the state is not loaded. When transitioning to active, the RM loads the internal state from the designated state store and starts all the internal services. The stimulus to transition-to-active comes from either the administrator (through the CLI) or through the integrated failover controller when automatic failover is enabled. The subsections that follow provide more details about the components of RM HA.
RM Restart
RM Restart allows restarting the RM, while recovering the in-flight applications if recovery is enabled. To achieve this, the RM stores its internal state, primarily application-related data and tokens, to the RMStateStore; the cluster resources are re-constructed when the NodeManagers connect. The available alternatives for the state store are MemoryRMStateStore (a memory-based implementation), FileSystemRMStateStore (file system-based implementation; HDFS can be used for the file system), and ZKRMStateStore (ZooKeeper-based implementation).
Fencing
When running two RMs, a split-brain situation can arise where both RMs assume they are Active. To avoid this, only a single RM should be able to perform Active operations and the other RM should be "fenced". The ZooKeeper-based state store (ZKRMStateStore) allows a single RM to make changes to the stored state, implicitly fencing the other RM. This is accomplished by the RM claiming exclusive create-delete permissions on the root znode. The ACLs on the root znode are automatically created based on the ACLs configured for the store; in case of secure clusters, Cloudera recommends that you set ACLs for the root node such that both RMs share read-write-admin access, but have exclusive create-delete access. The fencing is implicit and doesn't require explicit configuration (as fencing in HDFS and MRv1 does). You can plug in a custom "Fencer" if you choose to – for example, to use a different implementation of the state store.
Configuration and FailoverProxy
In an HA setting, you should configure two RMs to use different ports (for example, ports on different hosts). To facilitate this, YARN uses the notion of an RM Identifier (rm-id). Each RM has a unique rm-id, and all the RPC configurations (<rpc-address>; for exampleyarn.resourcemanager.address) for that RM can be configured via <rpc-address>.<rm-id>. Clients, ApplicationMasters, and NodeManagers use these RPC addresses to talk to the Active RM automatically, even after a failover. To achieve this, they cycle through the list of RMs in the configuration. This is done automatically and doesn't require any configuration (as in does in HDFS and MRv1).
Manual transitions and failover
You can use the command-line tool yarn rmadmin to transition a particular RM to Active or Standby state, to fail over from one RM to the other, to get the HA state of an RM, and to monitor an RM's health.
Automatic failover
By default, RM HA uses ZKFC (ZooKeeper-based failover controller) for automatic failover in case the Active RM is unreachable or goes down. Internally, the ActiveStandbyElector is used to elect the Active RM. The failover controller runs as part of the RM (not as a separate process as in HDFS and MapReduce v1) and requires no further setup after the appropriate properties are configured in yarn-site.xml.
You can plug in a custom failover controller if you prefer.
Setting up ResourceManager HA
- Stop all YARN daemons,
- Update the configuration used by the ResourceManagers, NodeManagers and clients
- Start all YARN daemons
Step 1: Stop the YARN daemons
To stop the YARN daemons:
$ sudo service hadoop-mapreduce-historyserver stop $ sudo service hadoop-yarn-resourcemanager stop $ sudo service hadoop-yarn-nodemanager stop
Step 2: Configure Manual Failover, and Optionally Automatic Failover
Configure the following properties in yarn-site.xml as shown, whether you are configuring manual or automatic failover. They are sufficient to configure manual failover. You need to configure additional properties for automatic failover.
yarn.resourcemanager. ha.enabled |
ResourceManager, NodeManager, Client |
false |
true | Enable HA |
yarn.resourcemanager. ha.rm-ids |
ResourceManager, NodeManager, Client |
(None) |
Cluster-specific, e.g.,rm1,rm2 |
Comma-separated list of ResourceManager ids in this cluster. |
yarn.resourcemanager. ha.id |
ResourceManager |
(None) |
RM-specific, e.g.,rm1 |
Id of the current ResourceManager. Must be set explicitly on each ResourceManager to the appropriate value. |
yarn.resourcemanager. address.<rm-id> |
ResourceManager, Client |
(None) |
Cluster-specific |
The value of yarn.resourcemanager. address(Client-RM RPC) for this RM. Must be set for all RMs. |
yarn.resourcemanager. scheduler.address.<rm-id> |
ResourceManager, Client |
(None) |
Cluster-specific |
The value of yarn.resourcemanager. scheduler.address (AM-RM RPC) for this RM. Must beset for all RMs. |
yarn.resourcemanager. admin.address.<rm-id> |
ResourceManager, Client/Admin |
(None) |
Cluster-specific |
The value of yarn.resourcemanager. admin.address (RM administration) for this RM. Must be set for all RMs. |
yarn.resourcemanager. resource-tracker.address. <rm-id> |
ResourceManager, NodeManager |
(None) |
Cluster-specific |
The value of yarn.resourcemanager. resource-tracker.address (NM-RM RPC) for this RM. Must be set for all RMs. |
yarn.resourcemanager. webapp.address.<rm-id> |
ResourceManager, Client |
(None) |
Cluster-specific |
The value of yarn.resourcemanager. webapp.address (RM webapp) for this RM.Must be set for all RMs. |
yarn.resourcemanager. recovery.enabled |
ResourceManager |
false |
true |
Enable job recovery on RM restart or failover. |
yarn.resourcemanager. store.class |
ResourceManager |
org.apache.hadoop. yarn.server. resourcemanager. recovery. FileSystemRMStateStore |
org.apache. hadoop.yarn. server. resourcemanager. recovery. ZKRMStateStore |
The RMStateStore implementation to use to store the ResourceManager's internal state. The ZooKeeper- based store supports fencing implicitly, i.e., allows a single ResourceManager to make multiple changes at a time, and hence is recommended. |
yarn.resourcemanager. zk-address |
ResourceManager |
(None) |
Cluster- specific |
The ZooKeeper quorum to use to store the ResourceManager's internal state. |
yarn.resourcemanager. zk-acl |
ResourceManager |
world:anyone:rwcda |
Cluster- specific |
The ACLs the ResourceManager uses for the znode structure to store the internal state. |
yarn.resourcemanager.zk- state-store.root-node.acl |
ResourceManager |
(None) |
Cluster- specific |
The ACLs used for the root node of the ZooKeeper state store. The ACLs set here should allow both ResourceManagers to read, write, and administer, with exclusive access to create and delete. If nothing is specified, the root node ACLs are automatically generated on the basis of the ACLs specified through yarn.resourcemanager.zk-acl. But that leaves a security hole in a secure setup. |
To configure automatic failover:
Configure the following additional properties in yarn-site.xml to configure automatic failover.
yarn.resourcemanager. ha.automatic-failover.enabled |
ResourceManager |
true |
true | Enable automatic failover |
yarn.resourcemanager. ha.automatic-failover.embedded |
ResourceManager |
true |
true |
Use the EmbeddedElectorService to pick an Active RM from the ensemble |
yarn.resourcemanager. cluster-id |
ResourceManager |
No default value. |
Cluster- specific |
Cluster name used by the ActiveStandbyElector to elect one of the ResourceManagers as leader. |
The following is a sample yarn-site.xml showing these properties configured:
<configuration> <!-- Resource Manager Configs --> <property> <name>yarn.resourcemanager.connect.retry-interval.ms</name> <value>2000</value> </property> <property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.ha.automatic-failover.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.ha.automatic-failover.embedded</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.cluster-id</name> <value>pseudo-yarn-rm-cluster</value> </property> <property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>rm1,rm2</value> </property> <property> <name>yarn.resourcemanager.ha.id</name> <value>rm1</value> </property> <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value> </property> <property> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.store.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value> </property> <property> <name>yarn.resourcemanager.zk.state-store.address</name> <value>localhost:2181</value> </property> <property> <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name> <value>5000</value> </property> <!-- RM1 configs --> <property> <name>yarn.resourcemanager.address.rm1</name> <value>host1:23140</value> </property> <property> <name>yarn.resourcemanager.scheduler.address.rm1</name> <value>host1:23130</value> </property> <property> <name>yarn.resourcemanager.webapp.https.address.rm1</name> <value>host1:23189</value> </property> <property> <name>yarn.resourcemanager.webapp.address.rm1</name> <value>host1:23188</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address.rm1</name> <value>host1:23125</value> </property> <property> <name>yarn.resourcemanager.admin.address.rm1</name> <value>host1:23141</value> </property> <!-- RM2 configs --> <property> <name>yarn.resourcemanager.address.rm2</name> <value>host2:23140</value> </property> <property> <name>yarn.resourcemanager.scheduler.address.rm2</name> <value>host2:23130</value> </property> <property> <name>yarn.resourcemanager.webapp.https.address.rm2</name> <value>host2:23189</value> </property> <property> <name>yarn.resourcemanager.webapp.address.rm2</name> <value>host2:23188</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address.rm2</name> <value>host2:23125</value> </property> <property> <name>yarn.resourcemanager.admin.address.rm2</name> <value>host2:23141</value> </property> <!-- Node Manager Configs --> <property> <description>Address where the localizer IPC is.</description> <name>yarn.nodemanager.localizer.address</name> <value>0.0.0.0:23344</value> </property> <property> <description>NM Webapp address.</description> <name>yarn.nodemanager.webapp.address</name> <value>0.0.0.0:23999</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>/tmp/pseudo-dist/yarn/local</value> </property> <property> <name>yarn.nodemanager.log-dirs</name> <value>/tmp/pseudo-dist/yarn/log</value> </property> <property> <name>mapreduce.shuffle.port</name> <value>23080</value> </property> </configuration>
Step 3: Re-start the YARN daemons
To re-start the YARN daemons:
$ sudo service hadoop-mapreduce-historyserver start $ sudo service hadoop-yarn-resourcemanager start $ sudo service hadoop-yarn-nodemanager start
Using yarn rmadmin to Administer ResourceManager HA
[-transitionToActive <serviceId>] [-transitionToStandby <serviceId>] [-getServiceState <serviceId>] [-checkHealth <serviceId>] [-help <command>]
where <serviceId> is the rm-id.
Even though -help lists the -failover option, it is not supported by yarn rmadmin.
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-High-Availability-Guide/cdh5hag_rm_ha_config.html
相关推荐
此包是AR9331主控的Openwrt系统上运行的node所需要的两个包(uclibcxx_0.2.4-1_ar71xx及node_v0.10.17-2_ar71xx) 具体的使用方法如下: 1、下载后,解压此包,使用SCP上传到/tmp目录下: root登陆到openwrt中并...
2. **复制必要的配置文件:** - 将`sdk\protcols\ethercat\ecat_appl\esi\TI_ESC.xml`和`sdk\examples\ethercat\esi\TiEtherCATLib.xml`文件复制到`<Drive>:\TwinCAT\Io\EtherCAT`目录下。 3. **启动TwinCAT系统...
2. **硬件配置**:为了充分利用HPC的优势,需要正确配置硬件资源。这可能涉及选择合适的服务器、存储解决方案和网络架构,以及优化内存和处理器设置。 3. **软件安装与管理**:指南会涵盖ANSYS产品在HPC环境中的...
Prepare for Microsoft Exam 70-410 – and help demonstrate your real-world mastery of implementing and configuring core services in Windows Server 2012 R2. Designed for experienced IT professionals ...
NAV350 报文解析 Telegram_listing_Telegrams_for_Configuring_and_Operating_the_NAV350_
2. InTouch会话动态分辨率转换:该技术笔记介绍了InTouch会话的动态分辨率转换技术,以及多种设置分辨率的方法,尤其是针对多个InTouch应用程序的场景。 3. 分辨率的三种设置方法: - 使用应用程序分辨率:运行...
4.1.1.11 Packet Tracer - Configuring Extended ACLs Scenario 2.pka
Portable and precise, this pocket-sized guide delivers ready answers for administering storage, security, and networking features in Windows 8.1. Zero in on core tasks through quick-reference tables, ...
总结起来,西门子技术指导文件《Configuring a TCP Connection for S7-300/S7-400 Industrial Ethernet CPs》为自动化领域工程师和技术人员提供了一套详尽的TCP连接配置指导,不仅包括了基本的网络和通讯处理器设置...
Configure and manage high availability Configure file and storage solutions Implement business continuity and disaster recovery Configure network services Configure the Active Directory infrastructure...
Syngress - Configuring IPv6 for Cisco IOS(2002) 配置IPv6
2. **配置高级负载均衡方法**:可以根据具体情况选择不同的负载均衡算法,如轮询、最少连接等。 3. **SSL第4层回退配置**:如果使用SSL进行加密通信,则需要配置SSL第4层回退选项,以处理非SSL连接的情况。 4. **...
Configuring SAP R3 FICO The Essential Resource for Configuring the Financial and Controlling Modules
Get a fast start to using AlwaysOn, the SQL Server solution to high-availability and disaster recovery. This second edition is newly-updated to cover the 2016 editions of both SQL Server and Windows ...