【Introduction to Hadoop High Availability Cluster】

In a typical HA cluster, each NameNode is an independent server. At any one time, only one NameNode is active and the other is in standby. Among them, the NameNode in the active state is responsible for all client operations, and the NameNode in the standby state is in a subordinate position, maintains the data state, and is ready to switch at any time.

For data synchronization, two NameNodes communicate with each other through a set of independent processes called JournalNodes. Most of the JournalNodes processes are notified when the namespace of the active NameNode is modified. The NameNode in the standby state has the ability to read the change information in the JNs, and always monitor the changes of the edit log, and apply the changes to its own namespace. standby ensures that in the event of a cluster failure, the namespace state is fully synchronized

To ensure fast switchover, it is necessary for the NameNode in the standby state to know the location of all data blocks in the cluster. In order to do this, all datanodes must be configured with two NameNode addresses, sending block location information and heartbeats to both of them.

For HA clusters, it is critical to ensure that only one NameNode is active at a time. Otherwise, the data states of the two NameNodes will diverge, data may be lost, or erroneous results may be produced. To ensure this, JNs must ensure that only one NameNode can write to itself at a time.

The concept of journalNode is newly added in MR2, which is Yarn. The role of journalNode is to store EditLog. In MR1, editlog is stored with fsimage and SecondNamenode is merged regularly. Yarn does not need SecondNamanode on this. The following is The current Yarn architecture diagram, focusing on the role of JournalNode.

Use shared storage to synchronize edits information between two NNs.

In the past, HDFS was share nothing but NN, but now NN shares storage, which is actually a single point of failure. However, mid-to-high-end storage devices have various RAID and redundant hardware, including power supplies and network cards, which are better than servers. The reliability is still slightly improved. Through the flush operation after each metadata change in the NN, plus the close-to-open of NFS, the consistency of the data is guaranteed. The community is now also trying to put metadata storage on BookKeeper to remove the dependency on shared storage. Cloudera also provides the implementation and code of Quorum Journal Manager. This Chinese blog has a detailed analysis: Based on QJM/Qurom Journal Manager/Paxos HDFS HA principle and code analysis

DataNode (hereinafter referred to as DN) simultaneously reports block information to two NNs.

This is a necessary step to keep Standby NN up-to-date on the cluster, so I won't go into details.

FailoverController process for monitoring and controlling NN processes

Obviously, we cannot synchronize heartbeat and other information in the NN process. The simplest reason is that a FullGC can make the NN hang for more than ten minutes. Therefore, an independent short and powerful watchdog must be specially responsible for monitoring. This is also a loosely coupled design, which is easy to expand or change. In the current version, ZooKeeper (hereinafter referred to as ZK) is used for synchronization locks, but users can easily replace this ZooKeeper FailoverController (hereinafter referred to as ZKFC) with other HA solutions or Leader election scheme.

Isolation (Fencing), preventing split brain), is to ensure that there is only one main NN at any time, including three aspects:

Shared storage fencing ensures that only one NN can write edits.

Client fencing ensures that only one NN can respond to the client's request.

DataNode fencing ensures that only one NN can issue commands to the DN, such as deleting blocks, copying blocks, etc.

【Introduction to Hadoop High Availability Cluster】

Guess you like