Hadoop HA mechanism and principle analysis

1. Architecture diagram of hadoop 1.x and 2.x

1. Architecture diagram

hadoop1.x architecture diagram
hadoop2.x architecture diagram
hadoop2.x solved the single point of failure of NameNode in 1.x by introducing the dual NameNode architecture and simultaneously using the shared storage system Quorum Journal Manager QJM to synchronize the metadata.

2. hadoop2.x metadata

The main function of Hadoop metadata is to maintain information about files and directories in the HDFS file system. There are three main types of metadata storage: memory mirroring, disk mirroring (FSImage), and log (EditLog) . When the Namenode starts, it will load the disk image into the memory for metadata management and store it in the NameNode memory; the disk image is a snapshot of HDFS metadata information at a certain time, including all related Datanode node file block mapping relationships and namespaces (Namespace) information, stored in the NameNode local file system ; the log file records every operation information initiated by the client, that is, saves all modifications to the file system, used to periodically merge with the disk image into the latest image, to ensure NameNode metadata information Complete, stored in NameNode local and shared storage system (QJM) .

The following shows the EditLog and FSImage file formats of the NameNode. The EditLog file has two states: inprocess and finalized , inprocess indicates the log file being written, and the file name is in the form of editsinprocess [start-txid], and finalized indicates the log that has been written. File, file name form: edits [start-txid] [end-txid]; FSImage file also has two states, finalized and checkpoint , finalized means that the file has been persistent disk, file name form: fsimage_ [end-txid], checkpoint Represents fsimage in the merger. The checkpoint process of the 2.x version is performed on the Standby Namenode (SNN). The SNN will periodically merge the local FSImage and the EditLog of the ANN pulled back from QJM. After the merge is completed, it will be returned to the ANN through RPC.
data / hbase / runtime / namespace
├── current
│ ├── VERSION
│ ├── edits_0000000003619794209-0000000003619813881
│ ├── edits_0000000003619813882-0000000003619831665
│ ├── edits_0000000003619831666-0000000003619852153
│ ├── edits_0000000003619852154-
├── edits_0000000003619871028-0000000003619880765 │
│ ├── edits_0000000003619880766-0000000003620060869
│ ├── edits_inprogress_0000000003620060870
│ ├── fsimage_0000000003618370058
│ ├── fsimage_0000000003618370058.md5
│ ├── fsimage_0000000003620060869
│ ├── fsimage_0000000003620060869.md5
│ └── seen_txid
└─ ─ in_use.lock
Another important file shown above is seen_txid, which saves a transaction ID . This transaction ID is the latest end transaction ID of EditLog. When the NameNode restarts, it will traverse sequentially from edits_0000000000000000001 to seen_txid. The log file where the recorded txid is located is used for metadata recovery. If the file is lost or the recorded transaction ID is faulty, data block information will be lost.

The essence of HA is to ensure that the metadata of the master and standby NN are consistent, that is, to ensure that fsimage and editlog are also complete on the standby NN. The synchronization of metadata depends to a large extent on the synchronization of EditLog, and the key to this step is the shared file system. Let ’s start with the QJM shared storage mechanism.

Second, QJM principle

1. Introduction to QJM

The full name of QJM is Quorum Journal Manager, composed of JournalNode (JN), generally composed of odd-numbered nodes. Each JournalNode has a simple RPC interface for NameNode to read and write EditLog to JN local disk. When writing an EditLog, the NameNode will write files to all JournalNodes in parallel at the same time. As long as N / 2 + 1 nodes write successfully, the write operation is considered successful and the Paxos protocol is followed. The internal implementation framework is as follows:
JN architecture diagram
It can be seen from the figure that it mainly involves different management objects and output stream objects related to EditLog, and each object plays its own different role:

FSEditLog: the entry of all EditLog operations
: JournalSet: integration of local disks and related operations of
EditLog on the JournalNode cluster FileJournalManager: implementation of EditLog operations on local disks
QuorumJournalManager: implementation of JournalNode cluster EditLog operations
AsyncLoggerSet: implementation of JournalNode cluster EditLog write operations collection
AsyncLogger: initiation of RPC requests To JN, perform the specific log synchronization function
JournalNodeRpcServer: RPC service running in the JournalNode node process, receiving the RPC request from the AsyncLogger on the NameNode side.
JournalNodeHttpServer: An Http service running in the JournalNode node process, used to receive requests for synchronization of EditLog file streams between NameNode and other JournalNodes in Standby state .

2. Analysis of QJM writing process

As mentioned above for EditLog, NameNode will write EditLog to both local and JournalNode. Writing locally is controlled by the parameter dfs.namenode.name.dir in the configuration, writing JN is controlled by the parameter dfs.namenode.shared.edits.dir, and two different output streams are used to control the log writing process when writing EditLog, respectively For: EditLogFileOutputStream (local output stream) and QuorumOutputStream (JN output stream). Writing EditLog is not directly written to disk. To ensure high throughput, NameNode will define two equal-sized Buffers for EditLogFileOutputStream and QuorumOutputStream, the size is about 512KB, a write Buffer (buffCurrent), a synchronous Buffer (buffReady), this way You can synchronize while writing, so EditLog is an asynchronous writing process, but also a batch synchronization process to avoid synchronizing the log every time a write is made. This double-buffer asynchronous write mechanism can convert hdfs metadata from write disk to write memory. If the cluster size is large, 100+ units, the metadata buffer may become a limiting condition. You can rewrite the source code and adjust the size of 512kb to about 5mb. Further, you can also refer to other configuration parameters of Hadoop to extract this option and configure it from the configuration file. Currently, configuration is not supported.

How does this achieve synchronization while writing, there is actually a buffer exchange process in the middle, that is, bufferCurrent and buffReady will trigger the exchange when the condition is reached , such as when bufferCurrent reaches the threshold and the data of bufferReady is synchronized again, bufferReady The data will be cleared, at the same time the bufferCurrent pointer will be pointed to bufferReady to continue writing, and the bufferReady pointer will be pointed to bufferCurrent to provide continued synchronization of the EditLog. The above process is represented by a flowchart as follows:
Insert picture description here

Question one:

Since EditLog is written asynchronously, how to ensure that the data in the cache is not lost, in fact, although it is asynchronous, but all logs need to be successfully synchronized by logSync before returning a success code to the client , assuming that the NameNode is not available at a certain time The data in its memory is actually not synchronized successfully, so the client will think that this part of the data has not been successfully written.

Question two:

How can EditLog be consistent across multiple JNs?

1. Isolated double write

Every time ANN synchronizes EditLog to JN, it is necessary to ensure that no two NNs can synchronize logs to JN at the same time. How is this isolation done? This involves an important concept, Epoch Numbers, which is used in many distributed systems. Epoch has the following characteristics:

  • When NN becomes the active node, it will be given an EpochNumber
  • Each EpochNumber is unique, and no same EpochNumber will appear
  • EpochNumber has a strict order guarantee. After each NN switch, its EpochNumber will increase by one. The EpochNumber generated later will be greater than the previous EpochNumber.
    How does the QJM ensure the above characteristics? The main points for using EpochNumber are:
  • In the first step, before making any changes to EditLog, the QuorumJournalManager (on the NameNode) must be given an EpochNumber
  • In the second step, QJM sends its EpochNumber to all JN nodes via newEpoch (N)
  • The third step, when JN receives the newEpoch request, it will save QJM's EpochNumber into a lastPromisedEpoch variable and persist it to the local disk
  • In the fourth step, any RPC request (such as logEdits (), startLogSegment (), etc.) that ANN synchronizes logs to JN must include ANN's EpochNumber
  • In the fifth step, after receiving the RPC request, JN will compare it with lastPromisedEpoch. If the requested EpochNumber is less than lastPromisedEpoch, the synchronization request will be rejected. Otherwise, the synchronization request will be accepted and the requested EpochNumber will be saved in lastPromisedEpoch

In this way, it can be ensured that when the master-standby NN is switched, even if the log is synchronized to the JN at the same time, the log will not be garbled, because after the switch, the original ANN EpochNumber is definitely smaller than the new ANN EpochNumber, so the original ANN to JN All the synchronization requests initiated will be rejected, and the isolation function will be implemented to prevent brain splitting.

2. Log synchronization

This step has introduced the process of synchronizing logs from ANN to JN, as follows:

  • 1 Execute the logSync process and put the log data on ANN into the cache queue
  • 2 Synchronize the data in the cache to JN, and JN has corresponding threads to process logEdits requests
  • 3 After receiving the data, JN first confirms whether EpochNumber is legal, then verifies whether the log transaction ID is normal, flashes the log to disk, and returns the ANN success code
  • 4 ANN returns the client write success flag after receiving the JN successful request, and throws an exception if it fails

3. Restore the in-process log

Why this step? If the writing fails during the writing process, the length of the EditLog on each JN may be different, and it is necessary to restore the inconsistent parts before starting to write . The recovery mechanism is as follows (in short, it is to compare and confirm the update of EpochNumber first, and then perform log recovery based on the largest EditLogSegment returned by JN. After successful recovery, change the log status to finalized):

  • 1 ANN first sends getJournalState requests to all JNs;
  • 2 JN will return an Epoch (lastPromisedEpoch) to ANN;
  • 3 After receiving most of the Epochs of JN, ANN selects the largest one and adds 1 as the current new Epoch, and then sends a new newEpoch request to JN to deliver the new Epoch to JN;
  • 4 After JN receives the new Epoch, compare it with lastPromisedEpoch. If it is larger, it will be updated to the local and returned to ANN itself. The latest local EditLogSegment start transaction Id, if it is small, it will return NN error;
  • 5 After receiving the majority of JN's successful responses, ANN believes that Epoch is generated successfully and begins to prepare for log recovery;
  • 6 ANN will select the largest EditLogSegment transaction ID as the basis for recovery, and then send prepareRecovery to JN; RPC request, corresponding to Phase1a of the 2p phase of the Paxos protocol. If most JNs respond to prepareRecovery successfully, then Phase1a can be considered successful;
  • 7 ANN selects the data source for synchronization, sends an acceptRecovery RPC request to JN, and passes the data source as a parameter to JN.
  • 8 After JN receives the acceptRecovery request, it will download the EditLogSegment from JournalNodeHttpServer and replace it with the locally saved EditLogSegment, which corresponds to Phase1b of the 2p stage of the Paxos protocol. Upon completion, it returns the status of ANN request success.
  • 9 After ANN receives the successful response request from most JNs, it sends a finalizeLogSegment request to JN, indicating that the data recovery is completed, so that the logs on all JNs will remain consistent.
    After the data is restored, the log in the in-process state will be renamed to the log in the finalized state in the form of edits [start-txid] [stop-txid].

2. Analysis of QJM reading process

This reading process is geared towards standby NN (SNN), which periodically checks the EditLog on JournalNode for changes and then pulls the EditLog back to the local. There is a thread StandbyCheckpointer on the SNN, which will periodically merge the FSImage and EditLog on the SNN, and pass the merged FSImage file back to the main NN (ANN), which is called the Checkpointing process. Let's take a look at how Checkpointing works.

In version 2.x, the original Checkpointing dominated by SecondaryNameNode has been replaced by Checkpointing dominated by SNN. The following is a flow chart of CheckPoint:
Insert picture description here
In general, the preconditions are checked on the SNN. The preconditions include two aspects: the interval from the last Checkpointing and the limit on the number of transactions in the EditLog . If any of the preconditions are met, Checkpointing will be triggered, and then the SNN will save the latest NameSpace data, that is, the metadata of the current state in the SNN memory, to a temporary fsimage file (fsimage.ckpt) and then compare the latest EditLog pulled from JN Transaction ID, merge all the metadata modification records not in fsimage.ckpt_ and EditLog and rename them into a new fsimage file, and generate an md5 file at the same time. Send the latest fsimage back to ANN via HTTP request. What are the benefits of merging fsimage regularly, mainly in the following aspects:

  • It can avoid that the EditLog is getting bigger and bigger, and the old EditLog can be deleted after merging into a new fsimage
  • Can avoid excessive pressure on the main NN (ANN), the merger is performed on the SNN
  • Can guarantee that fsimage saves a copy of the latest metadata, to avoid data loss when the fault is recovered

3. Active and standby switching mechanism

To complete HA, in addition to metadata synchronization, there must be a complete master-slave switching mechanism. The Hadoop master-slave election relies on ZooKeeper. The following is the state diagram of the master-slave switch:
Insert picture description here
As can be seen from the figure, the entire switch process is controlled by ZKFC , which can be divided into three components: HealthMonitor, ZKFailoverController and ActiveStandbyElector.

  • ZKFailoverController: is the parent of HealthMontior and ActiveStandbyElector, and performs specific switching operations
  • HealthMonitor: monitors the health status of the NameNode. If the status is abnormal, it will trigger the callback ZKFailoverController to automatically switch the active and standby
  • ActiveStandbyElector: notify ZK to perform the master-slave election, if ZK completes the change, it will call back the corresponding method of ZKFailoverController to switch the master-standby state

During the failover, what role does ZooKeeper mainly play , have the following points:

  • Failure protection: Every NameNode in the cluster will maintain a persistent session in ZooKeeper. Once the machine hangs, the session will expire and the failure migration will be triggered
  • Active NameNode selection: ZooKeeper has a mechanism to select ActiveNN. Once the existing ANN is down, other NameNodes can apply to ZooKeeper to become the next Active node.
  • Anti-brain split: ZK itself is strongly consistent and highly available, you can use it to ensure that there is only one active node at a time

In which scenarios will automatic switching be triggered , the following scenarios are summarized from HDFS-2185:

  • ActiveNN JVM crashes: HealthMonitor status reported on ANN will have connection timeout exception, HealthMonitor will trigger state transition to SERVICE_NOT_RESPONDING, then ZKFC on ANN will exit the election, ZKFC on SNN will get Active Lock, and become active node after corresponding isolation .
  • ActiveNN JVM freezes: This is because the JVM has not crashed, but it cannot respond. Like the crash, it will trigger automatic switching.
  • ActiveNN machine down: At this time, ActiveStandbyElector will lose the heartbeat with ZK, and the session will time out. ZKFC on SNN will notify ZK to delete ANN's active lock and complete the master-standby switch after corresponding isolation.
  • Abnormal ActiveNN health status: At this time, HealthMonitor will receive a HealthCheckFailedException and trigger automatic switching.
  • Active ZKFC crash : Although ZKFC is an independent process, it is also easy to cause problems due to its simple design. Once the ZKFC process hangs, although the NameNode is OK at this time, the system also thinks that it needs to be switched, and the SNN will send a request at this time. When ANN asks ANN to give up the position of the main node, after ANN receives the request, it will trigger the completion of automatic switching .
  • ZooKeeper crash : If ZK crashes, the ZKFC on the master and backup NN will sense disconnection. At this time, the master and backup NN will enter a NeutralMode mode without changing the state of the master and backup NN and continue to play a role. If the ANN also fails, the cluster will not be able to perform Failover, and it will be unavailable, so for this scenario, ZK is generally not allowed to hang up to multiple units, at least N / 2 + 1 units are required to maintain service to be considered safe. .

4. Summary

The HadoopHA mechanism introduced above is summarized in two parts: metadata synchronization and master-slave election . Metadata synchronization depends on QJM shared storage, and the master and backup elections depend on ZKFC and Zookeeper.

Published 9 original articles · praised 0 · visits 63

Guess you like

Origin blog.csdn.net/yangbllove/article/details/105544430