Source code of this article: GitHub·click here || GitEE·click here
1. HDFS high availability
1. Basic description
In the case of a single point or a small number of node failures, the cluster can still provide services normally. The HDFS high-availability mechanism can eliminate the problem of single node failure by configuring Active/Standby two NameNodes nodes to achieve hot standby of the NameNode in the cluster. If a single node fails, the NameNode can be quickly switched to another node in this way.
2. Detailed mechanism
- High availability based on two NameNodes, relying on shared Edits files and Zookeeper cluster;
- Each NameNode node is configured with a ZKfailover process, responsible for monitoring the status of the NameNode node;
- The NameNode maintains a persistent session with the ZooKeeper cluster;
- If the Active node fails and shuts down, ZooKeeper notifies the NameNode node in the Standby state;
- After the ZKfailover process detects and confirms that the failed node cannot work;
- ZKfailover notifies the NameNode in the Standby state to switch to the Active state to continue the service;
ZooKeeper is very important in the big data system. It coordinates the work of different components, maintains and transmits data. For example, the above-mentioned automatic failover under high availability depends on the ZooKeeper component.
Two, HDFS high availability
1. Overall configuration
Service list | HDFS files | YARN scheduling | Single service | shared documents | Zk cluster |
---|---|---|---|---|---|
hop01 | DataNode | NodeManager | NameNode | JournalNode | ZK-hop01 |
hop02 | DataNode | NodeManager | ResourceManager | JournalNode | ZK-hop02 |
hop03 | DataNode | NodeManager | SecondaryNameNode | JournalNode | ZK-hop03 |
2. Configure JournalNode
Create a directory
[root@hop01 opt]# mkdir hopHA
Copy Hadoop directory
cp -r /opt/hadoop2.7/ /opt/hopHA/
Placement core-site.xml
<configuration>
<!-- NameNode集群模式 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
<!-- 指定hadoop运行时产生文件的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hopHA/hadoop2.7/data/tmp</value>
</property>
</configuration>
Configure hdfs-site.xml , add the following content
<!-- 分布式集群名称 -->
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
<!-- 集群中NameNode节点 -->
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
<!-- NN1 RPC通信地址 -->
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>hop01:9000</value>
</property>
<!-- NN2 RPC通信地址 -->
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>hop02:9000</value>
</property>
<!-- NN1 Http通信地址 -->
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>hop01:50070</value>
</property>
<!-- NN2 Http通信地址 -->
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>hop02:50070</value>
</property>
<!-- 指定NameNode元数据在JournalNode上的存放位置 -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hop01:8485;hop02:8485;hop03:8485/mycluster</value>
</property>
<!-- 配置隔离机制,即同一时刻只能有一台服务器对外响应 -->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<!-- 使用隔离机制时需要ssh无秘钥登录-->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value>
</property>
<!-- 声明journalnode服务器存储目录-->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/opt/hopHA/hadoop2.7/data/jn</value>
</property>
<!-- 关闭权限检查-->
<property>
<name>dfs.permissions.enable</name>
<value>false</value>
</property>
<!-- 访问代理类失败自动切换实现方式-->
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
Start the journalnode service in turn
[root@hop01 hadoop2.7]# pwd
/opt/hopHA/hadoop2.7
[root@hop01 hadoop2.7]# sbin/hadoop-daemon.sh start journalnode
Delete data under hopHA
[root@hop01 hadoop2.7]# rm -rf data/ logs/
NN1 format and start NameNode
[root@hop01 hadoop2.7]# pwd
/opt/hopHA/hadoop2.7
bin/hdfs namenode -format
sbin/hadoop-daemon.sh start namenode
NN2 synchronizes NN1 data
[root@hop02 hadoop2.7]# bin/hdfs namenode -bootstrapStandby
NN2 starts NameNode
[root@hop02 hadoop2.7]# sbin/hadoop-daemon.sh start namenode
View current status
Start all DataNodes on NN1
[root@hop01 hadoop2.7]# sbin/hadoop-daemons.sh start datanode
NN1 switches to Active state
[root@hop01 hadoop2.7]# bin/hdfs haadmin -transitionToActive nn1
[root@hop01 hadoop2.7]# bin/hdfs haadmin -getServiceState nn1
active
3. Failover configuration
Configure hdfs-site.xml , the new content is as follows, synchronize the cluster
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
Configure core-site.xml , the new content is as follows, synchronize the cluster
<property>
<name>ha.zookeeper.quorum</name>
<value>hop01:2181,hop02:2181,hop03:2181</value>
</property>
Close all HDFS services
[root@hop01 hadoop2.7]# sbin/stop-dfs.sh
Start the Zookeeper cluster
/opt/zookeeper3.4/bin/zkServer.sh start
hop01 initializes the HA state in Zookeeper
[root@hop01 hadoop2.7]# bin/hdfs zkfc -formatZK
hop01 starts the HDFS service
[root@hop01 hadoop2.7]# sbin/start-dfs.sh
NameNode node starts ZKFailover
Here, the service status of hop01 and hop02 started first is Active, and hop02 is started first.
[hadoop2.7]# sbin/hadoop-daemon.sh start zkfc
End the NameNode process of hop02
kill -9 14422
Wait a moment to check the status of hop01
[root@hop01 hadoop2.7]# bin/hdfs haadmin -getServiceState nn1
active
Three, YARN high availability
1. Basic description
The basic process and ideas are similar to the HDFS mechanism, relying on the Zookeeper cluster. When the Active node fails, the Standby node will switch to the Active state for continuous service.
2. Detailed configuration
The environment is also demonstrated based on hop01 and hop02.
Configure yarn-site.xml to synchronize services under the cluster
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!--启用HA机制-->
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<!--声明Resourcemanager服务-->
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>cluster-yarn01</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>hop01</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>hop02</value>
</property>
<!--Zookeeper集群的地址-->
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>hop01:2181,hop02:2181,hop03:2181</value>
</property>
<!--启用自动恢复机制-->
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<!--指定状态存储Zookeeper集群-->
<property>
<name>yarn.resourcemanager.store.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
</configuration>
Restart the journalnode node
sbin/hadoop-daemon.sh start journalnode
Format and start the NN1 service
[root@hop01 hadoop2.7]# bin/hdfs namenode -format
[root@hop01 hadoop2.7]# sbin/hadoop-daemon.sh start namenode
Synchronize NN1 metadata on NN2
[root@hop02 hadoop2.7]# bin/hdfs namenode -bootstrapStandby
Start the DataNode under the cluster
[root@hop01 hadoop2.7]# sbin/hadoop-daemons.sh start datanode
NN1 is set to Active state
Start hop01 first, then start hop02.
[root@hop01 hadoop2.7]# sbin/hadoop-daemon.sh start zkfc
hop01 start yarn
[root@hop01 hadoop2.7]# sbin/start-yarn.sh
hop02 start ResourceManager
[root@hop02 hadoop2.7]# sbin/yarn-daemon.sh start resourcemanager
Check status
[root@hop01 hadoop2.7]# bin/yarn rmadmin -getServiceState rm1
Fourth, the source code address
GitHub·地址
https://github.com/cicadasmile/big-data-parent
GitEE·地址
https://gitee.com/cicadasmile/big-data-parent
Recommended reading: finishing programming system
Serial number | project name | GitHub address | GitEE address | Recommended |
---|---|---|---|---|
01 | Java describes design patterns, algorithms, and data structures | GitHub·click here | GitEE·Click here | ☆☆☆☆☆ |
02 | Java foundation, concurrency, object-oriented, web development | GitHub·click here | GitEE·Click here | ☆☆☆☆ |
03 | Detailed explanation of SpringCloud microservice basic component case | GitHub·click here | GitEE·Click here | ☆☆☆ |
04 | Comprehensive case of SpringCloud microservice architecture actual combat | GitHub·click here | GitEE·Click here | ☆☆☆☆☆ |
05 | Getting started with SpringBoot framework basic application to advanced | GitHub·click here | GitEE·Click here | ☆☆☆☆ |
06 | SpringBoot framework integrates and develops common middleware | GitHub·click here | GitEE·Click here | ☆☆☆☆☆ |
07 | Basic case of data management, distribution, architecture design | GitHub·click here | GitEE·Click here | ☆☆☆☆☆ |
08 | Big data series, storage, components, computing and other frameworks | GitHub·click here | GitEE·Click here | ☆☆☆☆☆ |