Hadoop namenode high availability analysis: QJM core source code interpretation

Hadoop namenode high availability analysis: QJM core source code interpretation

Background introduction

HDFS namenode records logs when it accepts write operations. The earliest HDFS log is written locally. After each restart or failure, it can be restored to the state before the downtime through the local mirror file + operation log, without data inconsistency. If you want to make high availability (HA), the log is written on a single machine. If the machine has a disk problem, it cannot be recovered after restarting, resulting in inconsistent data. The phenomenon is that the newly created file does not exist, and the successful deletion is still waiting for weird phenomena. This is something that a distributed storage system cannot tolerate.

On a stand-alone system, the WAL (write ahead log) log is used to ensure that the problem can be recovered. On the HDFS, the corresponding operation log (EditLog) is used to record the behavior description of each operation. Here we briefly introduce the format of editlog.

file format

  • Editing log edits_inprogress_txid, which is the segment mentioned later, txid represents the first transaction ID of the log file
  • Finalized log is a log file that is consistent and no longer changes edits_fristTxit_endTxid

Content format

File header: there is a version number + a transaction header identifier

document content

1 Operation type-1 byte
2 Log length-4 bytes
3 Transaction txid-8 bytes
4 Specific content
5 Checksum-4 bytes

End of file: a transaction identifier

Note that when there is no journal distributed log before, an identifier INVALID_TXID will be added to the end of the log each time the log is flushed, and the identifier will be overwritten in the next flush, but the current version removes this identifier

The reliability of the stand-alone system can be achieved through editlog, but in a distributed environment, to ensure the high availability of the namenode, at least two namemodes are required. To achieve high availability and high reliability, the first thing is to ensure that the HDFS operation log (EditLog) has a copy. However, the existence of replicas introduces new problems. How to ensure the consistency between multiple replicas is a problem that must be solved by distributed storage. For this reason, Clouder company developed QJM (Quorum Journal Manager) to solve this problem.

Journal Node cluster

The Journal node is designed according to the paxos idea. Only half of the writing returns success, even if the writing is successful this time. Therefore, the journal needs to deploy 3 units to form a cluster. The core idea is to write more than half of Quorum to multiple Journal Nodes asynchronously.

Logging process

The process of writing editlog to multiple nodes is briefly described as follows:

ActiveNamenode writes logs to Journal Node, using RPC long connection
StandbyNamenode to synchronize the finally log to generate mirror files, and Journal Node directly synchronizes data, using HTTP

ActiveNamenode writes a log every time it receives a transaction request. There are many good articles on the Internet for analysis of this log writing process. Here is just a summary of what is worth learning and some good design ideas.

1 Batch flash disk

This should be said to be a common practice for writing logs. If every log is flushed to the disk, the efficiency is very low. If the batch is flushed, many small IOs can be merged (similar to MySQL group commit)

2 Double buffer switching

bufCurrent Log write buffer
bufReady The buffer to flush the disk

If there is no double buffer, our log buffer is full, and we must force the disk to be flushed. We know that flushing the disk is not only writing to the operating system kernel buffer, but also to the disk device. This is a very time-consuming operation. Double buffer, disk flushing operation and log writing operation can be executed concurrently, which greatly improves the throughput of Namenode.

Data recovery

The data is restored after the Active Namenode crash. After the standby namenode takes over, the first thing that needs to be done after changing to the Active Namenode is to restore the previous active namenode crash, which caused the editlog data in the journal node to be inconsistent. Therefore, when the standby node can officially announce that it can work, it is necessary to make the data of the journal node cluster consistent. The following mainly analyzes the recovery algorithm. The recovery algorithm is officially based on the multi paxos algorithm.

Multi Paxos

The Paxos protocol is the most complex protocol in distributed systems. The Internet is mainly about concepts and theories, not less about practice, so this article is also written to better understand paxos. Paxos has a lot of information on the Internet. You can read the ppt recently shared by Denbo, which is very easy to understand.
Multi Paxos is an improved version of paxos, because Basic paxos generates a new proposal every round of paxos, which is usually written by multiple points, just like zk Leader election, everyone can initiate an election. But most of our distributed systems have a leader, and all have a leader who initiates a proposal. Then you can use the first proposal number to execute the accept phase directly. From the practice of qjm, it is a bit similar to RAFT. Have the role of leader. Reuse the current proposal number epoch

Data recovery process:

1 Isolate
2 Select recovery source
3 Recovery

1 isolation

Before starting to recover, he needs to isolate his ex to prevent his sudden resurrection and cause split brain. The isolation measure is newEpoch, which regenerates a new epoch. The algorithm is to calculate the largest one among all jn nodes, add 1, and then command the journal node cluster to update the epoch. After the update, if the predecessor is resurrected, he cannot write data to the journal node cluster, because his epoch is smaller than the journal cluster, and he will be rejected.

The new Epoch code is generated as follows:
Hadoop namenode high availability analysis: QJM core source code interpretation

The rejected code is as follows:
Hadoop namenode high availability analysis: QJM core source code interpretation

2 Choose a recovery source

After the isolation is successful, you need to select a copy to restore. The latest segment file of each journal is inconsistent, because the time of the namenode crash varies. So you need information from the latest copy in the journal cluster.
Hadoop namenode high availability analysis: QJM core source code interpretation

3 Recovery

After the isolation is successful, the recovery begins. In a distributed system, in order to make the data of each node agree, the classic algorithm is Paxos. According to Paxos, it is divided into two stages and explained as follows: The two stages of QJM correspond to PrepareRecover and AccepteRecover. Note that it is Paxos. It is Multi Paxos, the difference is that epoch reuses. The core algorithm is Paxos.

3.1 PrepareRecovery

Send proposals to all journal nodes, select a restored segment, and return the following information about the segment:

  1. Whether there is a segment
  2. If there is a segment, the status of the segment is attached
  3. committedTxnId The committedTxnId of the transaction ID that the journal node has committed, QJM will update the committedTxnId of each AsyncLogger after each log synchronization. The journal node also checks the committedTxnId passed in each request. If it is greater than that, it will be updated locally.
  4. The number corresponding to the latest log file of lastWriterEpoch will be recorded or updated every time a new segment is written, that is, the startLogSegment RPC call
  5. AcceptedInEpoch resumed the accepted proposal number last time and persisted in the accept phase. When will AcceptedInEpoch be greater than LastWriterEpoch? When the accept is successful after a paxos protocol is executed, assume that the epoch is 1, and the lastWriterEpoch is also 1, then the current epoch It was 2 (newEpoch) but when it was finalized, the ActiveNamenode crashed again when it was sent to the last journal node. At this time, this did not receive the finalize request. His AcceptedInEpoch is 2, and his lastWriterEpoch is still 1, because there is no stargLogSegment. , So it is still 1. In this case, when you perform paxos recovery next time, you should restore the segemnt corresponding to AcceptedInEpoch. This is also a fault-tolerant way to ensure consistency when the 2 segment commit (2PC) fails in the commit phase. Draw on.

3.2 AccepteRecovery

According to the result selected by PrepareRecovery, select a segment according to an algorithm, and send an accept request to all journals, telling them to agree with the specified segment. How to reach an agreement will be analyzed below.

PrepareRecover corresponds to the first stage of Paxos, AccepteRecover corresponds to the second stage

Before analyzing the specific 2PC implementation, take a picture to understand the general process
Hadoop namenode high availability analysis: QJM core source code interpretation

The main processes in the above figure are summarized as follows

  • Prepare Recovery
  • PrepareRecoverRequest
  • prepareResponse
  • checkRequest and select a segment as the synchronization source
  • Accept Recovery
  • Client initiates AcceptRecovery
  • Journal accepts AcceptRecovery request
  • Check whether the segment contains transactions after accepting the request
  • After accepting the request, check whether the last paxos was completed normally. The check here is to determine whether the data needs to be synchronized
  • commit

The main behavior analysis of each stage is as follows:

PrepareRecoverRequest(P1a)

In the first stage, initiate a proposal

Hadoop namenode high availability analysis: QJM core source code interpretation
Server Journal(prepareResponse) P1b:
Hadoop namenode high availability analysis: QJM core source code interpretation
checkRequest

The journal initiates a proposal in newEpoch, and accepts the proposal through checkRequest to check the validity of the proposal number epoch, and do the corresponding operation

Hadoop namenode high availability analysis: QJM core source code interpretation
Select a segment as the synchronization source

After the first stage of preparation for recovery is completed, if more than half of the nodes return, you need to select a most suitable copy from these returned log file segments. Here is the selection algorithm

The selected algorithm is as follows:

  1. It is possible to choose a file with a segment to restore, because some journal nodes may not contain the corresponding segment
  2. Both protect the segment files, check their startTxid, if they are not equal, this is illogical, throw an exception
  3. If there are segments, compare their status, Finalizer takes precedence over InProgress, because finalized represents the latest
  4. If both segments are finalized, check whether their lengths are the same. Inconsistency is also abnormal, because finalized will not change and the lengths should be the same. Choose one if the same
  5. Compare Epoch, if the epoch is not the same, choose the latest epoch, here pay special attention to the comparison of AcceptedInEpoch and lastWriterEpoch mentioned above
  6. If the Epochs are equal, compare the length of the segment files and select the longer
    Hadoop namenode high availability analysis: QJM core source code interpretation
    client to initiate AcceptRecovery (P2a)

The completion of the first stage is to select a value from the proposal response as the proposal to initiate the accept request. The selection algorithm has been described above, and then the accept request is issued.

Journal accepts AcceptRecovery request (P2b)

In the accept phase, the proposal number epoch needs to be checked, because the acceptance is done in the proposal phase.

Hadoop namenode high availability analysis: QJM core source code interpretation

1 Check whether the segment contains transactions after accepting the request

Hadoop namenode high availability analysis: QJM core source code interpretation

2 After accepting the request, check whether the last paxos was completed normally. The check here is to determine whether the data needs to be synchronized

Check whether there is data that was not restored last time, that is, the last round of paxos failed, and a new restoration was initiated. Here is to check whether the last round of paxos instance is completed and exit correctly; if it does not exit normally, you need to determine the proposal number, if If the number epoch of this accept is smaller than the epoch of the last round of paxos, it is wrong.
Hadoop namenode high availability analysis: QJM core source code interpretation

currentSegment is the local log segment of the current journal. There are two cases where data needs to be synchronized from other journal nodes

1. currentSegment is null, in this case, the active namenode crashed before sending the log to the journal, and it was a new segment
2. The file exists, but the length of the segment is inconsistent with the length of the segment to be restored

After the client recovers successfully, if more than half of them return successfully, do finalize

After accept is successful, do the third stage, commit, here is the finalize operation, rename the file so that it can be read by namenode

Hadoop namenode high availability analysis: QJM core source code interpretation

Journal Node failure

In the distributed log system, in addition to the logical processing under normal conditions, the more important thing is how to tolerate disasters. If more than half, do not write directly, because QJM core is more than half, but if only one of them fails, it can be tolerated.

In the case of one of the Journal Node Crash, QJM will not send the log stream to the failed Journal Node and mark outOfSync as true. When will it send data to the node again? When writing a new log file, that is, when the startLogSegment RPC request is made, after the request is successful, it will check whether the corresponding node outOfSync is true, if it is, then re-mark false and let it start accepting logs. If it is in the process of writing logs , There is a temporary failure of a node, such as a network disconnection, and later recovery. Before writing a new log file, QJM will only send a heartbeat to the node that failed during the writing process, and bring the current transaction ID (txid), not immediately Start writing and think about it. If you write immediately, what will happen? At least there will be transaction disconnection, because the transaction during the failure period is not written to the node.

About the author

Peng Rongxin, the architect of Shanghai Oudian Cloud Information Technology Co., Ltd., is personally interested in the underlying technologies such as distributed storage and concurrency, and has been learning.

Related Reading

  • Paxos principles, history and actual combat that architects need to understand
  • Cases|S3, Cassandra, HDFS design hidden high-availability rules
  • Is it really low with ZooKeeper? Discussion on configuration service plans for thousands of node scenarios-High-availability architecture series

Technical originality and architecture practice articles are welcome to submit via the "Contact Us" menu of the official account. Please indicate that it is from the high-availability framework "ArchNotes" WeChat official account and include the following QR code.

Highly available architecture

Changing the way the internet is built

Hadoop namenode high availability analysis: QJM core source code interpretation
Long press the QR code to follow the ``High Availability Architecture'' official account

Heavy at the end of the year: High-availability architecture hosts the GIAC Global Internet Architecture Conference to promote the future of technical architecture. Click to read the original text to enter the event registration page.
Hadoop namenode high availability analysis: QJM core source code interpretation

Guess you like

Origin blog.51cto.com/14977574/2547422