HDFS HA High Availability Cluster Construction Detailed Graphical Tutorial

Table of contents

1. Background knowledge of high availability (HA)

1.1 Single point of failure

1.2 How to solve single point of failure 

1.2.1 Active and standby clusters 

1.2.2  Active、Standby

1.2.3 High availability

1.2.4 Judging Criteria for Cluster Availability (x 9s)

1.3 Core issues of HA system design 

1.3.1 Split brain problem 

1.3.2 Data state synchronization problem

2. NAMENODE single point of failure 

2.1 Overview 

2.2 Solve 

3. HDFS HA ​​solution--QJM

3.1 QJM—active/standby switching, split brain problem solving

3.1.1 ZKFailoverController(zkfc)

3.1.2 Fencing (isolation) mechanism 

3.2 Solve the problem of synchronization of the master and backup data status 

4. HDFS HA ​​cluster construction 

4.1 HA cluster planning 

4.2 Cluster basic environment preparation 

4.3 Modify the Hadoop configuration file 

4.3.1 hadoop-env.sh

4.3.2 core-site.xml

4.3.3 hdfs-site.xml 

4.4 Cluster synchronization installation package

4.5 HA cluster initialization 

4.6 HA cluster startup 

5. HDFS HA ​​Cluster Demonstration 

5.1 View the status of two NameNodes on the Web page 

5.2 Normal operation under HA cluster

5.3 Simulate fault occurrence 

5.3.1 HA automatic switchover failed - error resolution


1. Background knowledge of high availability ( HA )

1.1  Single point of failure

        Single point of failure (English: single point of failure, abbreviated  SPOF ) means that once a point in the system fails, the entire system will fail to operate. In other words, a single point of failure will cause the entire system to fail.

1.2  How to solve single point of failure 

1.2.1 Active and standby clusters 

        The core of solving a single point of failure and achieving high availability of system services is not to prevent failures from ever happening, but to minimize the impact of failures on the business. Because software and hardware failures are unavoidable problems.

        The current mature practice in enterprises is to set up backups for single points of failure to form an active-standby architecture . The popular description is that when the master hangs up, the backup takes over and continues to provide services after a short interruption.

        A common architecture is one master and one backup . Of course, one master and multiple backups are also possible. The more backups, the stronger the fault tolerance. At the same time, the greater the redundancy, the waste of resources.

1.2.2  Active、Standby

  • Active : The main role. An active role represents a role service that is providing services externally. At any time, there is only one  active  that provides external services.
  • Standby : Backup role. It is necessary to keep data and state synchronization with the main role, and be ready to switch to the main role (when the main role hangs or fails), provide external services, and maintain service availability.

1.2.3  High availability

        High availability (English: high availability , abbreviated as HA ), an IT term, refers to the ability of the system to perform its functions without interruption, and represents the degree of availability of the system. It is one of the criteria for system design. A highly available system means that system services can run longer, usually by increasing the system's fault tolerance.

        High availability or high reliability systems do not want a single point of failure to cause the overall failure of the situation . Generally , multiple components with the same function can be added in a redundant manner. As long as these components do not fail at the same time, the system (or at least part of the system) can still operate, which will improve reliability.

1.2.4  Judging criteria for cluster availability ( )

        In the high availability of the system, there is a standard to measure its reliability - X  9 , this  represents the number  3-5 . The X nines  represent  the  ratio of the normal use time of the system to the total time ( year) during the use of the system for one  year .

  • Three 9s : (1-99.9%)*365*24=8.76 hours, which means that the maximum possible business interruption time of the system in one year of continuous operation is 8.76 hours .
  • Four 9s : (1-99.99%)*365*24=0.876 hours =52.6 minutes, which means that the maximum possible business interruption time of the system in one year of continuous operation is 52.6 minutes.
  • Five 9s : ( 1-99.999 %)*365*24*60=5.26 minutes, which means that the maximum possible service interruption time of the system during one year of continuous operation is 5.26 minutes.

It can be seen that  the more 9, the stronger the reliability of the system, and the less service interruption time can be tolerated, but the cost to be paid is higher. 

1.3  Core issues of HA  system design 

1.3.1 Split brain problem 

        Split-brain (split-brain) refers to "brain division", which is a medical term . In an HA cluster, split-brain refers to that when the " heartbeat line " connecting the active and standby nodes is disconnected ( that is, when the two nodes are disconnected ) , the HA  system, which was originally a whole and coordinated in action  , splits into two independent nodes. Due to the loss of contact with each other, the active and standby nodes are like " split-brain people " , making the entire cluster in a state of chaos .

Serious consequences of split brain :

  1. The cluster has no owner : everyone thinks that the other party is in a good state, and they are in the backup role, resulting in no service;

  2. Cluster multi-master : all think that the other party is faulty, and they are the master role. Competing with each other for shared resources will result in system confusion and data corruption. In addition, I am also confused about client access, who can I find?

The core of avoiding the split-brain problem is to keep the system with one and only one main role to provide services at any time .

1.3.2 Data state synchronization problem

        The prerequisite for active-standby switchover to ensure continuous service availability is that the state and data between the active and standby nodes are consistent , or quasi-consistent. If the data gap between the backup node and the master node is too large, even if the master-standby switchover is completed, it is meaningless.

        A common practice for data synchronization is to replay operation records through logs . The primary role provides services normally, the transactional operations that occur are recorded through logs, and the standby role reads the logs to replay operations.

2. NAMENODE  single point of failure 

2.1 Overview 

        Prior to Hadoop 2.0.0, the NameNode  was  a single point of failure ( SPOF ) in an HDFS  cluster . There is only one  NameNode per cluster , if  the NameNode  process is unavailable, the entire  HDFS  cluster is unavailable .

The single point of failure of the NameNode affects the overall availability of the HDFS cluster in two ways:

  • In the event of an unexpected event such as a machine crash, the cluster will be unavailable until the NameNode is restarted.
  • Planned maintenance events, such as software or hardware upgrades on the NameNode machines, will cause extended cluster downtime.

2.2 Solve 

Run two ( since 3.0.0  , more than two are supported ) redundant  NameNodes         in the same cluster  . Form an active-standby architecture. This allows for a fast failover to a new NameNode in case of a machine crash  , or a graceful administrator-initiated failover for planned maintenance purposes.

3. HDFS HA  ​​solution -- QJM

        The full name of QJM is Quorum Journal Manager (arbitration log manager ), which is  one of the HDFS HA  ​​solutions officially recommended by  Hadoop  .

Use ZKFC          in zookeeper  to realize active-standby switching ; use Journal Node ( JN ) cluster to realize  sharing of edits log  to achieve data synchronization.

3.1 QJM—active/standby switching, split brain problem solving

3.1.1 ZKFailoverControllerzkfc

        Apache ZooKeeper is a highly available distributed coordination service software used to maintain a small amount of coordination data. The following features and functions of Zookeeper participate in the HA solution of HDFS :

  • temporary znode

        If a znode node is ephemeral, then the lifetime of the znode is tied to the session of the client that created it. When the client disconnects and the session ends, the znode will be automatically deleted.

  • Path path uniqueness

        A data structure similar to a directory tree is maintained in zookeeper. Each node is called Znode. Znodes are unique and will not have duplicate names. It can also be understood as exclusivity.

  • monitoring mechanism

        The client can monitor the events that occur on the znode. When the event triggers, the zk service will notify the client of the event.

        ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client. Every computer that runs a NameNode also runs a ZKFC, whose main responsibilities are: 

  • Monitor and manage  NameNode  health status

ZKFC monitors the health status of  NameNode  nodes and machines through commands .

  • Maintain  contact with the ZK  cluster

        If the local  NameNode  is healthy and  the ZKFC  sees that no other node currently holds the lock  znode , it will try to acquire the lock itself. If successful, it " won the election " and is responsible for running a failover to make its local  NameNode  Active  . If other nodes already hold the lock and the zkfc  election fails, the node will be registered as a listener and wait for the next election to continue.

3.1.2  Fencing (isolation) mechanism 

        The failover process is also commonly known as the process of active and standby role switching. The most feared thing in the switching process is the occurrence of split brain . Therefore , a Fencing  mechanism is needed  to avoid, isolate the previous  Active  node, and then convert  Standby  to  Active  state.

Two  Fenching  implementations are provided in the Hadoop public library , namely  sshfence  and  shellfence (the default implementation ).

  • sshfence  refers to  log in to the target node  through ssh  , and use the command fuser  to kill the process (  the process pid is located by the tcp  port number  , which is  more accurate than the jps  command );

  • Shellfence  refers to the execution of a user-defined  shell  command (script) to complete the isolation.

3.2  Solve the problem of synchronization of master and backup data status 

        Journal Node ( JN ) cluster is a lightweight distributed system, mainly used for high-speed reading and writing of data and storing data . Usually,  2N+1  JournalNodes  are  used to store shared  Edits Log (edit log) . The bottom layer is similar to  ZK  's distributed consensus algorithm.

        When any modification operation is performed on the Active NN, the JournalNode  process will also record  the edits log  to at least half of  the JNs  . At this time, the Standby NN detects that  the synchronization log  in  the JN has changed and will read  the edits log in  the JN  , and then The replay operation record is synchronized to its own directory mirror tree .

        When a fault occurs and the Active NN hangs up, the Standby NN will read all the modification logs in the JN before it becomes the Active NN, so that it can be guaranteed to be consistent with the directory mirror tree of the hung NN with high reliability, and then seamlessly It takes over its responsibility and maintains requests from clients, so as to achieve a high availability purpose.

4. HDFS HA  ​​cluster construction 

4.1  HA  cluster planning 

 IP

 machine

run role

192.168.170.136

hadoop01

namenode  zkfc   datanode   zookeeper   journal node

192.168.170.137

hadoop02

namenode  zkfc   datanode   zookeeper   journal node

192.168.170.138 hadoop03

                 datanode   zookeeper   journal node

4.2  Cluster basic environment preparation 

  1. Modify  the Linux  hostname             /etc /hostname

  2. Modify  IP                              / etc / sysconfig /network-scripts/ifcfg-ens33

  3. Modify the mapping relationship between hostname and IP / etc /hosts

  4. turn off firewall

  5. ssh  free login

  6. Install  JDK , configure environment variables, etc.  / etc /profile

  7. Cluster time synchronization

  8. Configure password-free login between the active and standby  NNs 

Refer to this article for specific steps: Hadoop 3.2.4 cluster building detailed graphic tutorial_Stars.Sky's blog-CSDN blog 

Note: In the following, I will modify the Hadoop cluster built in this article, and only write out the changed places, and the other unchanged places are the same as the original ones. 

4.3 Modify the Hadoop configuration file 

4.3.1 hadoop-env.sh

[root@hadoop01 ~]# cd /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop/
[root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# vim hadoop-env.sh
# 配置JAVA_HOME
export JAVA_HOME=/usr/java/jdk1.8.0_381
# 设置用户以执行对应角色shell命令
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
export HDFS_JOURNALNODE_USER=root
export HDFS_ZKFC_USER=root

4.3.2 core-site.xml

[root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# vim core-site.xml 
<configuration>
<!-- HA 集群名称,该值要和 hdfs-site.xml 中的配置保持一致 -->
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://mycluster</value>
</property>
<!-- hadoop 本地数据存储目录 format 时自动生成 -->
<property>
    <name>hadoop.tmp.dir</name>
    <value>/bigdata/hadoop/data/tmp</value>
</property>
<!-- 在 Web UI 访问 HDFS 使用的用户名。-->
<property>
    <name>hadoop.http.staticuser.user</name>
    <value>root</value>
</property>
<!-- ZooKeeper 集群的地址和端口-->
<property>
    <name>ha.zookeeper.quorum</name>
    <value>hadoop01:2181,hadoop02:2181,hadoop03:2181</value>
</property>
</configuration>

4.3.3 hdfs-site.xml 

[root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# vim hdfs-site.xml 
<configuration>
<!--指定 hdfs 的 nameservice 为 mycluster,需要和 core-site.xml 中的保持一致 -->
<property>
    <name>dfs.nameservices</name>
    <value>mycluster</value>
</property>
<!-- mycluster 下面有两个 NameNode,分别是 nn1,nn2 -->
<property>
    <name>dfs.ha.namenodes.mycluster</name>
    <value>nn1,nn2</value>
</property>
<!-- nn1 的 RPC 通信地址 -->
<property>
    <name>dfs.namenode.rpc-address.mycluster.nn1</name>
    <value>hadoop01:8020</value>
</property>
<!-- nn1 的 http 通信地址 -->
<property>
    <name>dfs.namenode.http-address.mycluster.nn1</name>
    <value>hadoop01:9870</value>
</property>
<!-- nn2 的 RPC 通信地址 -->
<property>
    <name>dfs.namenode.rpc-address.mycluster.nn2</name>
    <value>hadoop02:8020</value>
</property>
<!-- nn2 的 http 通信地址 -->
<property>
    <name>dfs.namenode.http-address.mycluster.nn2</name>
    <value>hadoop02:9870</value>
</property>
<!-- 指定 NameNode 的 edits 元数据在 JournalNode 上的存放位置 -->
<property> 
    <name>dfs.namenode.shared.edits.dir</name>
    <value>qjournal://hadoop01:8485;hadoop02:8485;hadoop03:8485/mycluster</value>
</property>
<!-- 指定 JournalNode 在本地磁盘存放数据的位置 -->
<property>
    <name>dfs.journalnode.edits.dir</name>
    <value>/bigdata/hadoop/data/journaldata</value>
</property>
<!-- 开启 NameNode 失败自动切换 -->
<property>
    <name>dfs.ha.automatic-failover.enabled</name>
    <value>true</value>
</property>
<!-- 指定该集群出故障时,哪个实现类负责执行故障切换 -->
<property>
    <name>dfs.client.failover.proxy.provider.mycluster</name>
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!-- 配置隔离机制方法-->
<property>
    <name>dfs.ha.fencing.methods</name>
    <value>sshfence</value>
</property>
<!-- 使用 sshfence 隔离机制时需要 ssh 免登陆 -->
<property>
    <name>dfs.ha.fencing.ssh.private-key-files</name>
    <value>/root/.ssh/id_rsa</value>
</property>
<!-- 配置 sshfence 隔离机制超时时间 -->
<property>
    <name>dfs.ha.fencing.ssh.connect-timeout</name>
    <value>30000</value>
</property>
<!-- 开启短路本地读取功能 -->
<property>
  <name>dfs.client.read.shortcircuit</name>
  <value>true</value>
</property>
<!-- 需手动创建目录 mkdir -p /var/lib/hadoop-hdfs -->
<property>
  <name>dfs.domain.socket.path</name>
  <value>/var/lib/hadoop-hdfs/dn_socket</value>
</property>
<!-- 开启黑名单 -->
<property>
  <name>dfs.hosts.exclude</name>
  <value>/bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop/excludes</value>
</property>
</configuration>

4.4 Cluster synchronization installation package

[root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# scp -r hadoop-env.sh root@hadoop02:$PWD
hadoop-env.sh                                                                                                         100%   16KB   6.9MB/s   00:00    
[root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# scp -r hadoop-env.sh root@hadoop03:$PWD
hadoop-env.sh                                                                                                         100%   16KB 991.1KB/s   00:00    
[root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# scp -r core-site.xml root@hadoop02:$PWD
core-site.xml                                                                                                         100% 1404   507.9KB/s   00:00    
[root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# scp -r core-site.xml root@hadoop03:$PWD
core-site.xml                                                                                                         100% 1404   386.9KB/s   00:00    
[root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# scp -r hdfs-site.xml root@hadoop02:$PWD
hdfs-site.xml                                                                                                         100% 3256     1.1MB/s   00:00    
[root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# scp -r hdfs-site.xml root@hadoop03:$PWD
hdfs-site.xml                                                                                                         100% 3256     2.4MB/s   00:00

4.5 HA cluster initialization 

Install the zookeeper cluster: [Zookeeper Elementary] 02, Zookeeper Cluster Deployment_Stars.Sky's Blog-CSDN Blog 

#1. 首先启动 zookeeper 集群
[root@hadoop01 /bigdata/hadoop/zookeeper]# zk.sh start

#2. 手动启动 JN 集群(3台机器)
hdfs --daemon start journalnode

#3. 在 hadoop01 执行格式化 namenode 并启动 namenode
[root@hadoop01 ~]# hdfs namenode -format
[root@hadoop01 ~]# hdfs --daemon start namenode

#4. 在 hadoop02 上进行 namenode 元数据同步
[root@hadoop02 ~]# hdfs namenode -bootstrapStandby

#5. 格式化 zkfc。注意:在哪台机器上执行,哪台机器就将成为第一次的 Active NN
[root@hadoop01 ~]# hdfs zkfc -formatZK

4.6 HA cluster startup 

Start  the HDFS  cluster on hadoop01 : 

[root@hadoop01 ~]# start-dfs.sh 

[root@hadoop01 ~]# jps
6355 QuorumPeerMain
6516 JournalNode
7573 DataNode
7989 DFSZKFailoverController
8040 Jps
7132 NameNode

[root@hadoop02 ~]# jps
4688 JournalNode
5201 NameNode
5521 Jps
5282 DataNode
4536 QuorumPeerMain
5482 DFSZKFailoverController

[root@hadoop03 ~]# jps
4384 DataNode
3990 QuorumPeerMain
4136 JournalNode
4511 Jp

5. HDFS HA  ​​Cluster Demonstration 

5.1  View  the status of two NameNodes  on the Web  page 

On hadoop01, it shows that the namenode is active:

On hadoop02, it shows that the namenode is in standby state:

5.2  Normal operation under HA  cluster

[root@hadoop01 ~]# hadoop fs -mkdir /test02
[root@hadoop01 ~]# hadoop fs -put apache-zookeeper-3.7.1-bin.tar.gz /test02

It can be operated normally on Active  , but cannot be previewed on Standby  :

5.3  Simulate fault occurrence 

        In hadoop01,  kill  the  namenode  process manually . At this time, it is found that the namenode  on  hadoop02  has switched to  the Active  state and the hdfs  service is normally available.

[root@hadoop01 ~]# jps
6355 QuorumPeerMain
6516 JournalNode
7573 DataNode
7989 DFSZKFailoverController
8267 Jps
7132 NameNode
[root@hadoop01 ~]# kill -9 7132

5.3.1  HA  automatic switchover failed - error resolution

        Use kill -9 to simulate  a JVM  crash. Or power cycle the computer or unplug its network interface to simulate another failure. The other  NameNode  should automatically become active within seconds. The time it takes to detect a failure and trigger a failover depends on the configuration of ha.zookeeper.session-timeout.ms  , but the default is  seconds.

If the test is unsuccessful, check  the logs of the zkfc  daemon as well as  the NameNode  daemon to further diagnose the problem. If the error message is as follows :

        It prompts that the fuser program cannot be found, which makes it impossible to isolate , so you can install it through the following command. The fuser  program is included in the  Psmisc  software package it needs to be installed on both NN  machines)

[root@hadoop01 ~]# yum install psmisc -y
[root@hadoop02 ~]# yum install psmisc -y

Finally, restart the hdfs cluster and re-simulate the failure to realize automatic switching! ! !

Previous article: HDFS cluster dynamic node management_Stars.Sky's Blog-CSDN Blog 

Next article: Hadoop YARN HA Cluster Installation and Deployment Detailed Graphical Tutorial_Stars.Sky's Blog-CSDN Blog

Guess you like

Origin blog.csdn.net/weixin_46560589/article/details/132663857
Recommended