Detailed explanation of Namenode HA principle (split brain)

Detailed explanation of Namenode HA principle

The community hadoop2.2.0 release version began to support the HA of the NameNode. This article will describe the internal design and implementation of the NameNode HA in detail.

 

Why Namenode HA?

1. NameNode High Availability means high availability.

2. NameNode is very important. Hanging up will cause the storage to stop serving, data reading and writing cannot be performed, and calculations based on this NameNode (MR, Hive, etc.) cannot be completed.

 

How is Namenode HA implemented and what are the key technical challenges?

1. How to keep the status of the primary and secondary NameNode synchronized, and let Standby provide services quickly after Active hangs up. It takes time to start the namenode, including loading fsimage and editlog (getting file to block information), and processing the first blockreport of all datanodes (to obtain block to datanode information), to keep the state of NN synchronized, these two parts of information need to be synchronized.

2. Split-brain means that in a high availability (HA) system, when the two connected nodes are disconnected, the original system is divided into two independent nodes. Each node begins to compete for shared resources, resulting in system confusion and data corruption.

3. NameNode switching is transparent to the outside world. When the main Namenode switches to another machine, it should not cause the connecting client to fail, mainly including the connection between Client, Datanode and NameNode.

Community NN's HA architecture, implementation principle, implementation mechanism of each part, and what problems are solved?

1. In the non-HA Namenode architecture , there is only one NN in an HDFS cluster, the DN only reports to one NN, and the editlog of the NN is stored in the local directory.

2. Architecture of Community NN HA

Figure 1, NN HA architecture (copied from the community)

    The community's NN HA includes two NNs, active and standby, ZKFC, ZK, and share editlog. Process: After the cluster is started, a NN is in an active state and provides services, processes requests from clients and datanodes, and writes editlogs to local and share editlogs (which can be NFS, QJM, etc.). Another NN is in the Standby state. It loads the fsimage when it starts, and then periodically obtains the editlog from the share editlog to keep it synchronized with the active state. In order to realize that the standby can quickly provide services after the sctive hangs up, the DN needs to report to the two NNs at the same time, so that the Stadnby saves the block to datanode information, because the most time-consuming work in the NN startup is to process the blockreports of all datanodes. In order to realize hot backup, add FailoverController and ZK, FailoverController communicates with ZK, selects the master through ZK, FailoverController makes NN convert to active or standby through RPC.

 

2. Key questions:

(1) Keep the state of NN synchronized, obtain editlog periodically through standby, and DN also wants to standby to send blockreport.

(2) Prevention of split brain

  Shared storage fencing ensures that only one NN can write successfully. Use QJM to implement fencing, and the principle is described below.

  fencing of datanodes. Make sure that only one NN can command the DN. How DN implements fencing is described in detail in HDFS-1972

     (a) When each NN changes state, it sends its own state and a sequence number to the DN.

    (b) DN maintains this sequence number during operation. When failover, the new NN will return its own active state and a larger sequence number when returning the DN heartbeat. When the DN receives this return, it considers the NN as the new active.

     (c) If the original active (such as GC) resumes at this time, and the heartbeat information returned to the DN includes the active state and the original sequence number, the DN will reject the NN's command.

     (d) 特别需要注意的一点是,上述实现还不够完善,HDFS-1972中还解决了一些有可能导致误删除block的隐患,在failover后,active在DN汇报所有删除报告前不应该删除任何block。

   客户端fencing,确保只有一个NN能响应客户端请求。让访问standby nn的客户端直接失败。在RPC层封装了一层,通过FailoverProxyProvider以重试的方式连接NN。通过若干次连接一个NN失败后尝试连接新的NN,对客户端的影响是重试的时候增加一定的延迟。客户端可以设置重试此时和时间。

ZKFC的设计

1. FailoverController实现下述几个功能

  (a) 监控NN的健康状态

  (b) 向ZK定期发送心跳,使自己可以被选举。

  (c) 当自己被ZK选为主时,active FailoverController通过RPC调用使相应的NN转换为active。

2. 为什么要作为一个deamon进程从NN分离出来

  (1) 防止因为NN的GC失败导致心跳受影响。

  (2) FailoverController功能的代码应该和应用的分离,提高的容错性。

  (3) 使得主备选举成为可插拔式的插件。

图2 FailoverController架构(从社区复制)

3. FailoverController主要包括三个组件,

  (1) HealthMonitor 监控NameNode是否处于unavailable或unhealthy状态。当前通过RPC调用NN相应的方法完成。

  (2) ActiveStandbyElector 管理和监控自己在ZK中的状态。

  (3) ZKFailoverController 它订阅HealthMonitor 和ActiveStandbyElector 的事件,并管理NameNode的状态。

 

QJM的设计

  1. Namenode记录了HDFS的目录文件等元数据,客户端每次对文件的增删改等操作,Namenode都会记录一条日志,叫做editlog,而元数据存储在fsimage中。为了保持Stadnby与active的状态一致,standby需要尽量实时获取每条editlog日志,并应用到FsImage中。这时需要一个共享存储,存放editlog,standby能实时获取日志。这有两个关键点需要保证, 共享存储是高可用的,需要防止两个NameNode同时向共享存储写数据导致数据损坏。
  2. 是什么,Qurom Journal Manager,基于Paxos(基于消息传递的一致性算法)。这个算法比较难懂,简单的说,Paxos算法是解决分布式环境中如何就某个值达成一致,( 一个典型的场景是,在一个分布式数据库系统中,如果各节点的初始状态一致,每个节点都执行相同的操作序列,那么他们最后能得到一个一致的状态。为保证每个节点执行相同的命令序列,需要在每一条指令上执行一个"一致性算法"以保证每个节点看到的指令一致

     

    图3 QJM架构

  3. 如何实现,

    (1) 初始化后,Active把editlog日志写到2N+1上JN上,每个editlog有一个编号,每次写editlog只要其中大多数JN返回成功(即大于等于N+1)即认定写成功。

    (2) Standby定期从JN读取一批editlog,并应用到内存中的FsImage中。

    (3) 如何fencing: NameNode每次写Editlog都需要传递一个编号Epoch给JN,JN会对比Epoch,如果比自己保存的Epoch大或相同,则可以写,JN更新自己的Epoch到最新,否则拒绝操作。在切换时,Standby转换为Active时,会把Epoch+1,这样就防止即使之前的NameNode向JN写日志,也会失败。

    (4) 写日志:

      (a) NN通过RPC向N个JN异步写Editlog,当有N/2+1个写成功,则本次写成功。

      (b) 写失败的JN下次不再写,直到调用滚动日志操作,若此时JN恢复正常,则继续向其写日志。

      (c) 每条editlog都有一个编号txid,NN写日志要保证txid是连续的,JN在接收写日志时,会检查txid是否与上次连续,否则写失败。

    (5) 读日志:

      (a) 定期遍历所有JN,获取未消化的editlog,按照txid排序。

      (b) 根据txid消化editlog。

    (6) 切换时日志恢复机制

      (a) 主从切换时触发

      (b) 准备恢复(prepareRecovery),standby向JN发送RPC请求,获取txid信息,并对选出最好的JN。

      (c) 接受恢复(acceptRecovery),standby向JN发送RPC,JN之间同步Editlog日志。

      (d) Finalized日志。即关闭当前editlog输出流时或滚动日志时的操作。

      (e) Standby同步editlog到最新

    (7) 如何选取最好的JN

      (a) 有Finalized的不用in-progress

      (b) 多个Finalized的需要判断txid是否相等

      (c) 没有Finalized的首先看谁的epoch更大

      (d) Epoch一样则选txid大的。

     

    参考:

    1.https://issues.apache.org/jira/secure/attachment/12480489/NameNode%20HA_v2_1.pdf

    2.https://issues.apache.org/jira/secure/attachment/12521279/zkfc-design.pdf

    3.https://issues.apache.org/jira/secure/attachment/12547598/qjournal-design.pdf

    4. https://issues.apache.org/jira/browse/HDFS-1972

    5.https://issues.apache.org/jira/secure/attachment/12490290/DualBlockReports.pdf

    6.http://svn.apache.org/viewvc/Hadoop/common/branches/branch-2.2.0/

    7.http://yanbohappy.sinaapp.com/?p=205

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326608111&siteId=291194637