HDFS- High Availability

NameNode High Availability

Background

Before Hadoop 2.0.0, the NameNode had a single point of failure ( SPOF ) problem (1. The machine where the NameNode is located hangs; 2. The machine where the NameNode is located needs to be updated and maintained on hardware or software).

The new NameNode needs to (1) load the namespace image into memory (2) replay editlog (3) receive enough block reports from the datanodes to leave safe mode; in order to restart the service. In a large cluster with a large number of files and blocks, a cold start of the namenode may take 30 minutes or more.

Architecture

NameNode HA includes two NameNodes. At any time, only one NameNode is in Active state, and the other is in Standby state. It also includes ZooKeeper Failover Controller (ZKFC), ZooKeeper, and share edit log. . The Active NameNode is responsible for all client operations on the cluster, and the Standby NameNode acts as a slave, maintaining state information in order to provide fast failover.

The process of implementing HA

(1) After the cluster is started, a NameNode is in the Active state, providing services, processing requests from clients and DataNodes, and writing changes to the edit log, and then writing the edit log to the local and shared edit logs (NFS, QJM, etc.) .

[There are two types of shared edit logs,

1: If it is NFS, then the primary and secondary NameNode access a directory or shared storage device of NFS, the modification of the namespace by the Active Node is recorded in the edit log, and then the edit log is stored in the shared directory. Standby NameNode writes the edit log in the shared directory to its own namespace

2: In the case of QJM, the master and slave NameNode communicate with a separate set of daemon threads called Journal Nodes (JNs), Active Node records modifications to the majority (> (N / 2) + 1, N is Journal Number of Nodes), Standby NameNode can read and modify from JNs and write to its own namespace

(2) Another NameNode is in the standby state. It loads the Namespace Image file when it starts, and then periodically writes the shared edit log to its own namespace, thereby maintaining synchronization with the state of the Active NameNode. In the event of a failover, the Standby node needs to ensure that it has read all the edit logs from the shared edit log before becoming an Active node. This ensures full synchronization of the namespace state.

(3) In order to realize that the Standby NameNode can quickly provide services after the Active NameNode fails, each DataNode needs to send the block location information and heartbeat [block report] to the two NameNodes at the same time, because the most time-consuming work for the NameNode to start is Handles block reporting for all DataNodes. In order to achieve hot backup, ZKFC and ZooKeeper are added, and the master node is selected through ZK, and the Failover Controller makes the NameNode switch to master or slave through RPC.

(4) When the Active NameNode fails, the Standby NameNode can take over quickly because there is the latest status information in the Standby NameNode memory (1) the latest edit log (2) the latest block mapping

Two options for highly available shared storage: NFS and QJM

The quorum journal manager (QJM) is a specialized implementation of HDFS. Its sole purpose is to provide a highly available edit log. It is the recommended choice for most HDFS.

The working process of QJM: QJM runs a set of journal nodes, and each edit must be written to the journal nodes of the majority. Usually, the number of journal nodes is 3 (at least 3), so each edit must be written to at least two journal nodes, allowing one journal node to fail. [Similar to ZooKeeper, but QJM is not implemented by relying on ZooKeeper]

Use Fencing to prevent " split-brain "

Why Fencing?

A slow network or a network partition can trigger failover, even if the previous Active NameNode is still running and thinks it is still the Active NameNode , then HA needs to ensure that such a NameNode is prevented from continuing to run.

two kinds of isolation

(1) Ensure that only one of the master NameNoel and the slave NameNode can write the shared edit log at the same time through isolation

(2) DataNode isolation: To isolate the client, it is necessary to ensure that only one NameNode can respond to the client's request

For an HA cluster, there can only be one Active NameNode at a time, otherwise the state of the namespace will soon diverge into two, resulting in data loss and other incorrect results, namely " split-brain ". To prevent this from happening, a fencing method must be configured for shared storage. During failover, if it cannot be determined that the previous Active node has relinquished its Active state, the isolation process is responsible for cutting off the previous Active node's access to the shared edit storage , which prevents further editing of the namespace by the previous Active node, thus making The new Active node is able to failover safely. (i.e. once the primary NameNode fails, the shared storage needs to be isolated immediately to ensure that only one NameNode can command the DataNodes. After doing this, the client also needs to be isolated to ensure that only one NameNode can respond to the client's request. Let access to the slave nodes The client fails directly, and then tries to connect to the new NameNode after several failures. The impact on the client is to increase some retry time, but it is basically invisible to the application.)

why QJM recommended?

QJM only allows one NameNode to write the edit log at the same time; however, the previous Active NameNode may still serve the client's old read requests. At this time, the SSH fencing command can be set to kill the NameNode process.

Since it is impossible for NFS to allow only one NameNode to write data to it at the same time, NFS requires stronger fencing methods, including: 1. revoking the namenode's  access to the shared storage directory (typically by  using a vendor-specific NFS  command) ;2. disabling its network port  via a remote management command;3. STONITH, or “shoot the other node in the head,” which uses a  specialized power distribution unit to forcibly  power down the host machine.

Failover Controller

The conversion from Active NameNode to Standby NameNode is achieved through Failover Controller. The default implementation of Hadoop's FC is based on ZooKeeper, which ensures that there is only one Active NameNode. The function of the Failover Controller is to monitor the health status of the NameNode, operating system, and hardware. If the NameNode fails, failover is performed.

 

  NOTE: The Standby NameNode in the HA cluster also performs checkpoints for the state of the namespace (it is an error to run the Secondary NameNode, CheckpointNode, or BackupNode on HA) 

refer to:

(1)《Hadoop The Definitive Guide》

(2)http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324532675&siteId=291194637