Table of contents
1. Background knowledge of high availability (HA)
1.2 How to solve single point of failure
1.2.1 Active and standby clusters
1.2.4 Judging Criteria for Cluster Availability (x 9s)
1.3 Core issues of HA system design
1.3.2 Data state synchronization problem
2. NAMENODE single point of failure
3.1 QJM—active/standby switching, split brain problem solving
3.1.1 ZKFailoverController(zkfc)
3.1.2 Fencing (isolation) mechanism
3.2 Solve the problem of synchronization of the master and backup data status
4. HDFS HA cluster construction
4.2 Cluster basic environment preparation
4.3 Modify the Hadoop configuration file
4.4 Cluster synchronization installation package
5. HDFS HA Cluster Demonstration
5.1 View the status of two NameNodes on the Web page
5.2 Normal operation under HA cluster
5.3.1 HA automatic switchover failed - error resolution
1. Background knowledge of high availability ( HA )
1.1 Single point of failure
Single point of failure (English: single point of failure, abbreviated SPOF ) means that once a point in the system fails, the entire system will fail to operate. In other words, a single point of failure will cause the entire system to fail.
1.2 How to solve single point of failure
1.2.1 Active and standby clusters
The core of solving a single point of failure and achieving high availability of system services is not to prevent failures from ever happening, but to minimize the impact of failures on the business. Because software and hardware failures are unavoidable problems.
The current mature practice in enterprises is to set up backups for single points of failure to form an active-standby architecture . The popular description is that when the master hangs up, the backup takes over and continues to provide services after a short interruption.
A common architecture is one master and one backup . Of course, one master and multiple backups are also possible. The more backups, the stronger the fault tolerance. At the same time, the greater the redundancy, the waste of resources.
1.2.2 Active、Standby
- Active : The main role. An active role represents a role service that is providing services externally. At any time, there is only one active that provides external services.
- Standby : Backup role. It is necessary to keep data and state synchronization with the main role, and be ready to switch to the main role (when the main role hangs or fails), provide external services, and maintain service availability.
1.2.3 High availability
High availability (English: high availability , abbreviated as HA ), an IT term, refers to the ability of the system to perform its functions without interruption, and represents the degree of availability of the system. It is one of the criteria for system design. A highly available system means that system services can run longer, usually by increasing the system's fault tolerance.
High availability or high reliability systems do not want a single point of failure to cause the overall failure of the situation . Generally , multiple components with the same function can be added in a redundant manner. As long as these components do not fail at the same time, the system (or at least part of the system) can still operate, which will improve reliability.
1.2.4 Judging criteria for cluster availability ( x 9 )
In the high availability of the system, there is a standard to measure its reliability - X 9 , this X represents the number 3-5 . The X nines represent the ratio of the normal use time of the system to the total time ( 1 year) during the use of the system for one year .
- Three 9s : (1-99.9%)*365*24=8.76 hours, which means that the maximum possible business interruption time of the system in one year of continuous operation is 8.76 hours .
- Four 9s : (1-99.99%)*365*24=0.876 hours =52.6 minutes, which means that the maximum possible business interruption time of the system in one year of continuous operation is 52.6 minutes.
- Five 9s : ( 1-99.999 %)*365*24*60=5.26 minutes, which means that the maximum possible service interruption time of the system during one year of continuous operation is 5.26 minutes.
It can be seen that the more 9, the stronger the reliability of the system, and the less service interruption time can be tolerated, but the cost to be paid is higher.
1.3 Core issues of HA system design
1.3.1 Split brain problem
Split-brain (split-brain) refers to "brain division", which is a medical term . In an HA cluster, split-brain refers to that when the " heartbeat line " connecting the active and standby nodes is disconnected ( that is, when the two nodes are disconnected ) , the HA system, which was originally a whole and coordinated in action , splits into two independent nodes. Due to the loss of contact with each other, the active and standby nodes are like " split-brain people " , making the entire cluster in a state of chaos .
Serious consequences of split brain :
-
The cluster has no owner : everyone thinks that the other party is in a good state, and they are in the backup role, resulting in no service;
-
Cluster multi-master : all think that the other party is faulty, and they are the master role. Competing with each other for shared resources will result in system confusion and data corruption. In addition, I am also confused about client access, who can I find?
The core of avoiding the split-brain problem is to keep the system with one and only one main role to provide services at any time .
1.3.2 Data state synchronization problem
The prerequisite for active-standby switchover to ensure continuous service availability is that the state and data between the active and standby nodes are consistent , or quasi-consistent. If the data gap between the backup node and the master node is too large, even if the master-standby switchover is completed, it is meaningless.
A common practice for data synchronization is to replay operation records through logs . The primary role provides services normally, the transactional operations that occur are recorded through logs, and the standby role reads the logs to replay operations.
2. NAMENODE single point of failure
2.1 Overview
Prior to Hadoop 2.0.0, the NameNode was a single point of failure ( SPOF ) in an HDFS cluster . There is only one NameNode per cluster , if the NameNode process is unavailable, the entire HDFS cluster is unavailable .
The single point of failure of the NameNode affects the overall availability of the HDFS cluster in two ways:
- In the event of an unexpected event such as a machine crash, the cluster will be unavailable until the NameNode is restarted.
- Planned maintenance events, such as software or hardware upgrades on the NameNode machines, will cause extended cluster downtime.
2.2 Solve
Run two ( since 3.0.0 , more than two are supported ) redundant NameNodes in the same cluster . Form an active-standby architecture. This allows for a fast failover to a new NameNode in case of a machine crash , or a graceful administrator-initiated failover for planned maintenance purposes.
3. HDFS HA solution -- QJM
The full name of QJM is Quorum Journal Manager (arbitration log manager ), which is one of the HDFS HA solutions officially recommended by Hadoop .
Use ZKFC in zookeeper to realize active-standby switching ; use Journal Node ( JN ) cluster to realize sharing of edits log to achieve data synchronization.
3.1 QJM—active/standby switching, split brain problem solving
3.1.1 ZKFailoverController(zkfc)
Apache ZooKeeper is a highly available distributed coordination service software used to maintain a small amount of coordination data. The following features and functions of Zookeeper participate in the HA solution of HDFS :
- temporary znode
If a znode node is ephemeral, then the lifetime of the znode is tied to the session of the client that created it. When the client disconnects and the session ends, the znode will be automatically deleted.
- Path path uniqueness
A data structure similar to a directory tree is maintained in zookeeper. Each node is called Znode. Znodes are unique and will not have duplicate names. It can also be understood as exclusivity.
- monitoring mechanism
The client can monitor the events that occur on the znode. When the event triggers, the zk service will notify the client of the event.
ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client. Every computer that runs a NameNode also runs a ZKFC, whose main responsibilities are:
- Monitor and manage NameNode health status
ZKFC monitors the health status of NameNode nodes and machines through commands .
- Maintain contact with the ZK cluster
If the local NameNode is healthy and the ZKFC sees that no other node currently holds the lock znode , it will try to acquire the lock itself. If successful, it " won the election " and is responsible for running a failover to make its local NameNode Active . If other nodes already hold the lock and the zkfc election fails, the node will be registered as a listener and wait for the next election to continue.
3.1.2 Fencing (isolation) mechanism
The failover process is also commonly known as the process of active and standby role switching. The most feared thing in the switching process is the occurrence of split brain . Therefore , a Fencing mechanism is needed to avoid, isolate the previous Active node, and then convert Standby to Active state.
Two Fenching implementations are provided in the Hadoop public library , namely sshfence and shellfence (the default implementation ).
-
sshfence refers to log in to the target node through ssh , and use the command fuser to kill the process ( the process pid is located by the tcp port number , which is more accurate than the jps command );
-
Shellfence refers to the execution of a user-defined shell command (script) to complete the isolation.
3.2 Solve the problem of synchronization of master and backup data status
Journal Node ( JN ) cluster is a lightweight distributed system, mainly used for high-speed reading and writing of data and storing data . Usually, 2N+1 JournalNodes are used to store shared Edits Log (edit log) . The bottom layer is similar to ZK 's distributed consensus algorithm.
When any modification operation is performed on the Active NN, the JournalNode process will also record the edits log to at least half of the JNs . At this time, the Standby NN detects that the synchronization log in the JN has changed and will read the edits log in the JN , and then The replay operation record is synchronized to its own directory mirror tree .
When a fault occurs and the Active NN hangs up, the Standby NN will read all the modification logs in the JN before it becomes the Active NN, so that it can be guaranteed to be consistent with the directory mirror tree of the hung NN with high reliability, and then seamlessly It takes over its responsibility and maintains requests from clients, so as to achieve a high availability purpose.
4. HDFS HA cluster construction
4.1 HA cluster planning
IP | machine |
run role |
192.168.170.136 | hadoop01 |
namenode zkfc datanode zookeeper journal node |
192.168.170.137 | hadoop02 |
namenode zkfc datanode zookeeper journal node |
192.168.170.138 | hadoop03 | datanode zookeeper journal node |
4.2 Cluster basic environment preparation
-
Modify the Linux hostname /etc /hostname
-
Modify IP / etc / sysconfig /network-scripts/ifcfg-ens33
-
Modify the mapping relationship between hostname and IP / etc /hosts
-
turn off firewall
-
ssh free login
-
Install JDK , configure environment variables, etc. / etc /profile
-
Cluster time synchronization
-
Configure password-free login between the active and standby NNs
Refer to this article for specific steps: Hadoop 3.2.4 cluster building detailed graphic tutorial_Stars.Sky's blog-CSDN blog
Note: In the following, I will modify the Hadoop cluster built in this article, and only write out the changed places, and the other unchanged places are the same as the original ones.
4.3 Modify the Hadoop configuration file
4.3.1 hadoop-env.sh
[root@hadoop01 ~]# cd /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop/
[root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# vim hadoop-env.sh
# 配置JAVA_HOME
export JAVA_HOME=/usr/java/jdk1.8.0_381
# 设置用户以执行对应角色shell命令
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
export HDFS_JOURNALNODE_USER=root
export HDFS_ZKFC_USER=root
4.3.2 core-site.xml
[root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# vim core-site.xml
<configuration>
<!-- HA 集群名称,该值要和 hdfs-site.xml 中的配置保持一致 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
<!-- hadoop 本地数据存储目录 format 时自动生成 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/bigdata/hadoop/data/tmp</value>
</property>
<!-- 在 Web UI 访问 HDFS 使用的用户名。-->
<property>
<name>hadoop.http.staticuser.user</name>
<value>root</value>
</property>
<!-- ZooKeeper 集群的地址和端口-->
<property>
<name>ha.zookeeper.quorum</name>
<value>hadoop01:2181,hadoop02:2181,hadoop03:2181</value>
</property>
</configuration>
4.3.3 hdfs-site.xml
[root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# vim hdfs-site.xml
<configuration>
<!--指定 hdfs 的 nameservice 为 mycluster,需要和 core-site.xml 中的保持一致 -->
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
<!-- mycluster 下面有两个 NameNode,分别是 nn1,nn2 -->
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
<!-- nn1 的 RPC 通信地址 -->
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>hadoop01:8020</value>
</property>
<!-- nn1 的 http 通信地址 -->
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>hadoop01:9870</value>
</property>
<!-- nn2 的 RPC 通信地址 -->
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>hadoop02:8020</value>
</property>
<!-- nn2 的 http 通信地址 -->
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>hadoop02:9870</value>
</property>
<!-- 指定 NameNode 的 edits 元数据在 JournalNode 上的存放位置 -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hadoop01:8485;hadoop02:8485;hadoop03:8485/mycluster</value>
</property>
<!-- 指定 JournalNode 在本地磁盘存放数据的位置 -->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/bigdata/hadoop/data/journaldata</value>
</property>
<!-- 开启 NameNode 失败自动切换 -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<!-- 指定该集群出故障时,哪个实现类负责执行故障切换 -->
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!-- 配置隔离机制方法-->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<!-- 使用 sshfence 隔离机制时需要 ssh 免登陆 -->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value>
</property>
<!-- 配置 sshfence 隔离机制超时时间 -->
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
<!-- 开启短路本地读取功能 -->
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<!-- 需手动创建目录 mkdir -p /var/lib/hadoop-hdfs -->
<property>
<name>dfs.domain.socket.path</name>
<value>/var/lib/hadoop-hdfs/dn_socket</value>
</property>
<!-- 开启黑名单 -->
<property>
<name>dfs.hosts.exclude</name>
<value>/bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop/excludes</value>
</property>
</configuration>
4.4 Cluster synchronization installation package
[root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# scp -r hadoop-env.sh root@hadoop02:$PWD
hadoop-env.sh 100% 16KB 6.9MB/s 00:00
[root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# scp -r hadoop-env.sh root@hadoop03:$PWD
hadoop-env.sh 100% 16KB 991.1KB/s 00:00
[root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# scp -r core-site.xml root@hadoop02:$PWD
core-site.xml 100% 1404 507.9KB/s 00:00
[root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# scp -r core-site.xml root@hadoop03:$PWD
core-site.xml 100% 1404 386.9KB/s 00:00
[root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# scp -r hdfs-site.xml root@hadoop02:$PWD
hdfs-site.xml 100% 3256 1.1MB/s 00:00
[root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# scp -r hdfs-site.xml root@hadoop03:$PWD
hdfs-site.xml 100% 3256 2.4MB/s 00:00
4.5 HA cluster initialization
Install the zookeeper cluster: [Zookeeper Elementary] 02, Zookeeper Cluster Deployment_Stars.Sky's Blog-CSDN Blog
#1. 首先启动 zookeeper 集群
[root@hadoop01 /bigdata/hadoop/zookeeper]# zk.sh start
#2. 手动启动 JN 集群(3台机器)
hdfs --daemon start journalnode
#3. 在 hadoop01 执行格式化 namenode 并启动 namenode
[root@hadoop01 ~]# hdfs namenode -format
[root@hadoop01 ~]# hdfs --daemon start namenode
#4. 在 hadoop02 上进行 namenode 元数据同步
[root@hadoop02 ~]# hdfs namenode -bootstrapStandby
#5. 格式化 zkfc。注意:在哪台机器上执行,哪台机器就将成为第一次的 Active NN
[root@hadoop01 ~]# hdfs zkfc -formatZK
4.6 HA cluster startup
Start the HDFS cluster on hadoop01 :
[root@hadoop01 ~]# start-dfs.sh
[root@hadoop01 ~]# jps
6355 QuorumPeerMain
6516 JournalNode
7573 DataNode
7989 DFSZKFailoverController
8040 Jps
7132 NameNode
[root@hadoop02 ~]# jps
4688 JournalNode
5201 NameNode
5521 Jps
5282 DataNode
4536 QuorumPeerMain
5482 DFSZKFailoverController
[root@hadoop03 ~]# jps
4384 DataNode
3990 QuorumPeerMain
4136 JournalNode
4511 Jp
5. HDFS HA Cluster Demonstration
5.1 View the status of two NameNodes on the Web page
On hadoop01, it shows that the namenode is active:
On hadoop02, it shows that the namenode is in standby state:
5.2 Normal operation under HA cluster
[root@hadoop01 ~]# hadoop fs -mkdir /test02
[root@hadoop01 ~]# hadoop fs -put apache-zookeeper-3.7.1-bin.tar.gz /test02
It can be operated normally on Active , but cannot be previewed on Standby :
5.3 Simulate fault occurrence
In hadoop01, kill the namenode process manually . At this time, it is found that the namenode on hadoop02 has switched to the Active state and the hdfs service is normally available.
[root@hadoop01 ~]# jps
6355 QuorumPeerMain
6516 JournalNode
7573 DataNode
7989 DFSZKFailoverController
8267 Jps
7132 NameNode
[root@hadoop01 ~]# kill -9 7132
5.3.1 HA automatic switchover failed - error resolution
Use kill -9 to simulate a JVM crash. Or power cycle the computer or unplug its network interface to simulate another failure. The other NameNode should automatically become active within seconds. The time it takes to detect a failure and trigger a failover depends on the configuration of ha.zookeeper.session-timeout.ms , but the default is 5 seconds.
If the test is unsuccessful, check the logs of the zkfc daemon as well as the NameNode daemon to further diagnose the problem. If the error message is as follows :
It prompts that the fuser program cannot be found, which makes it impossible to isolate , so you can install it through the following command. The fuser program is included in the Psmisc software package ( it needs to be installed on both NN machines)
[root@hadoop01 ~]# yum install psmisc -y
[root@hadoop02 ~]# yum install psmisc -y
Finally, restart the hdfs cluster and re-simulate the failure to realize automatic switching! ! !
Previous article: HDFS cluster dynamic node management_Stars.Sky's Blog-CSDN Blog
Next article: Hadoop YARN HA Cluster Installation and Deployment Detailed Graphical Tutorial_Stars.Sky's Blog-CSDN Blog