hadoop 2.7.2 yarn Chinese documentation - ResourceManager High Availability

introduce
This guide provides an overview of HA for YARN's ResourceManager and how to configure and use these features. The ResourceManager (RM) is responsible for tracking resources in the cluster and scheduling applications (such as MapReduce jobs). Before hadoop 2.4, the ResourceManager had a potential single point of failure in the YARN cluster. The HA feature solves the single point of failure problem through the active and standby ResourceManager solution.
Architecture


 
RM disaster recovery
ResourceManager HA is implemented through the active-standby architecture - at any time, one RM is in Active mode, and the other one or more are in Standby mode, waiting for a certain situation to change to Active mode. The trigger to switch to Active is either from the administrator (via the CLI) or via the integrated failover-controller if automatic-failover is enabled.
Manual switchover and disaster recovery
When automatic disaster recovery is not enabled, the administrator has to manually switch one of the RMs to Active. To switch the disaster recovery RM to another, you need to first switch the Active-RM to Standby, and then switch one Standby-RM to Active. All of this can be done using the " yarn rmadmin " client command.
 
Automatic disaster recovery
RMs have an option to embed a Zookeeper-based primary-standby election to decide which RM should be Active. When Active goes down or becomes unavailable, other RMs are automatically elected as Active. Note that there is no need to run a separate ZKFC process like HDFS, because the active/standby election module embedded in the RM acts as failure detection and leader election, replacing the separate ZKFC process.
Client, ApplicationMaster and NodeManager on RM failover
When there are multiple RMs, the configuration file (yarn-site.xml) used by client and node will list all RMs. Clients, ApplicationMasters (AMs) and NodeManagers (NMs) try to connect to the RM in a round-robin manner until the Active RM is hit. If the Active is down, they will continue to poll in a round-robin fashion until a new Active is hit. The default retry logic implementation is org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider. You can override the logic by implementing org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider and setting yarn.client.failover-proxy-provider to the class name of this class.
 
Restore the previous Active-RM state
After ResourceManger Restart is enabled, an RM that is promoted to Active state relies on the RM restart feature to load the internal state of the last Active RM and continue processing its remaining work as much as possible. Each Application previously submitted to RM will spawn a new attempt. Applications can set checkpoint periodically to avoid losing work content. The state-store must be visible to both the active and standby RMs. There are currently two RMStateStore persistence implementations - FileSystemRMStateStore and ZKRMStateStore.
ZKRMStateStore  implicitly allows write operations to the RM at any time, so it is the recommended store for HA clusters. When using ZKRMStateStore, it is not necessary to address the underlying split-brain problem independently. When using ZKRMStateStore, it is recommended not to set the " zookeeper.DigestAuthenticationProvider.superDigest " property in the Zookeeper cluster to ensure that the zookeeper administrator cannot have access to the certificate information of the YARN application/user.
 
deploy
configure
Most of the functions of disaster recovery can be adjusted through various configuration properties. Below is a list of important instructions.
Configuration Properties Description
yarn.resourcemanager.zk-address zk's address. Leader election for state-store and embedded.
yarn.resourcemanager.ha.enabled Enable HA for RM.
yarn.resourcemanager.ha.rm-ids List of logical ids of RM, such as "rm1,rm2".
yarn.resourcemanager.hostname.rm-id For each rm-id, specify the hostname corresponding to rm. Alternatively, each RM's service address can be specified.
yarn.resourcemanager.address.rm-id Specify host:port for each rm-id. If set, it will override the hostname set by yarn.resourcemanager.hostname.rm-id.
yarn.resourcemanager.scheduler.address.rm-id For each rm-id, specify the scheduler address -host:port from which applicationMaster obtains resources. If set, it will override the hostname set by yarn.resourcemanager.hostname.rm-id.
yarn.resourcemanager.resource-tracker.address.rm-id 对每个rm-id设置NodeManager连接的host:port。如果设置, 会覆盖yarn.resourcemanager.hostname.rm-id设置的hostname。
yarn.resourcemanager.admin.address.rm-id 对每个rm-id指定管理命令的host:port. 如果设置, 会覆盖yarn.resourcemanager.hostname.rm-id设置的hostname。
yarn.resourcemanager.webapp.address.rm-id 对每个rm-id,指定RM webapp对应的host:port。如果你设置了 yarn.http.policy为HTTPS_ONLY,则不需要本属性。 如果设置, 会覆盖yarn.resourcemanager.hostname.rm-id设置的hostname。
yarn.resourcemanager.webapp.https.address.rm-id 对每个rm-id指定RM https webapp相应的host:port。如果你设定了 yarn.http.policy 为HTTP_ONLY,那么你就不需要该属性。 如果设置, 会覆盖yarn.resourcemanager.hostname.rm-id设置的hostname。
yarn.resourcemanager.ha.id RM在集群中的标识。这是可选的;然而如果设置,管理员需要确保在配置中所有的RM都有唯一的标识。
yarn.resourcemanager.ha.automatic-failover.enabled 启用自动灾备。默认只有在HA启用的时候才启动。
yarn.resourcemanager.ha.automatic-failover.embedded 当自动灾备启用之后,使用嵌入式的leader选举Active RM。默认只在HA启用后该项才启用。
yarn.resourcemanager.cluster-id 集群标识。
yarn.client.failover-proxy-provider Client、AMs和NMs应对Active RM failover使用的class。
yarn.client.failover-max-attempts FailoverProxyProvider 的最大尝试次数。
yarn.client.failover-sleep-base-ms 用于计算Failover之间的delay指数的sleep基数(毫秒为单位)
yarn.client.failover-sleep-max-ms failover之间的最大sleep时间(毫秒为单位)。
yarn.client.failover-retries 每个attempt 重试连接 ResourceManager的次数。
yarn.client.failover-retries-on-socket-timeouts 每个 attempt 连接到ResourceManager socket超时的次数。
 
样例配置
这里是RM灾备的最小配置样例。
<property>
  <name>yarn.resourcemanager.ha.enabled</name>
  <value>true</value>
</property>
<property>
  <name>yarn.resourcemanager.cluster-id</name>
  <value>cluster1</value>
</property>
<property>
  <name>yarn.resourcemanager.ha.rm-ids</name>
  <value>rm1,rm2</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname.rm1</name>
  <value>master1</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname.rm2</name>
  <value>master2</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address.rm1</name>
  <value>master1:8088</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address.rm2</name>
  <value>master2:8088</value>
</property>
<property>
  <name>yarn.resourcemanager.zk-address</name>
  <value>zk1:2181,zk2:2181,zk3:2181</value>
</property>
管理命令
yarn rmadmin有一小部分HA特定的命令选项,用来检查RM的健康的或者状态,以及在Active/StandyBy之间切换。HA的这些命令采用RM的服务Id(在 yarn.resourcemanager.ha.rm-ids  中设置的)作为参数
$ yarn rmadmin -getServiceState rm1
 active

 $ yarn rmadmin -getServiceState rm2
 standby
如果自动灾备是启用的,你不能使用手动切换命令。虽然你能通过-forcemanual标识强行覆盖它,这点你需要注意。
 $ yarn rmadmin -transitionToStandby rm1
 Automatic failover is enabled for org.apache.hadoop.yarn.client.RMHAServiceTarget@1d8299fd
 Refusing to manually manage HA state, since it may cause
 a split-brain scenario or other incorrect state.
 If you are very sure you know what you are doing, please
 specify the forcemanual flag.
ResourceManager Web UI services
假定一个standby RM是运行的,该Standby自动重定向所有的web请求到Active,除了“About”页面之外。
Web Services
假定一个standby RM是运行中,RM web-service在请求到standy时会自动重定向到Active RM。
 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327080044&siteId=291194637