NameNode 的 HA

 

HDFS in the NameNode HA how to achieve? (In a nutshell)
Hadoop cluster configuration and start the two NameNode process, provide services outside Active as a node, and the other as a Standby node, two NameNode start time will create a temporary order of nodes in Zookeeper, Zookeeper will take a minimum of corresponding to the node NameNode as Active, and the other as the Standby. Once the Active node dang off, then interim node in the corresponding NameNode in Zookeeper deleted, then the Standby Zookeeper temporary node corresponding to the smallest node, it automatically becomes the Active node to provide services. HA cluster using some way to share data between the two NameNode (in fact, edits log file in NameNode
HA to achieve the NameNode two ways: The first way is to journal node + Zookeeper. Another way is to NFS + Zookeeper
The difference lies in the way are two ways of synchronizing data between two different NameNode it, NFS + Zookeeper is to synchronize data between two NameNode via NFS (Network File System) remote shared directory, rather journal node + Zookeeper is to achieve synchronization of data between two edits by journal node NameNode

 

 

 

 

HA Overview

1) The so-called HA (High Available), namely high availability (7 * 24 hours uninterrupted service).

2) The key strategy to achieve high availability is to eliminate single points of failure. Strictly speaking HA HA mechanism should be divided into individual components: HDFS YARN HA and the HA.

Before 3) Hadoop2.0, single point of failure (SPOF) in HDFS cluster NameNode.

4) NameNode major impact HDFS cluster in two ways

       NameNode machine accidents, such as downtime, the cluster will not be available until the administrator restarted

       NameNode machines need to be upgraded, including software, hardware upgrades, this time the cluster will not be able to use

HDFS HA ​​By configuring two NameNodes implemented Active Standby / heat in the cluster NameNode Prepared to solve the above problems. If a fault occurs, such as a machine or the machine crashes need to upgrade and maintenance, then the switch can be quickly NameNode by this way to another machine.

 HDFS-HA mechanism

Single point of failure by the double NameNode

Main points HDFS-HA

1. Metadata management needs to change

Memory, keep a journal metadata;

Edits log only NameNode node Active state can do write operations;

Edits two NameNode can be read;

Edits shared on a shared storage management (qjournal and NFS achieve two mainstream);

2. The need for a state management module

Implements a zkfailover, resident in each node where a namenode, each zkfailover responsible for monitoring where their NameNode node using zk identify the state, when the state is required to switch from zkfailover responsible for switching, the need to prevent brain split switching the occurrence of the phenomenon.

3. Must be able to ensure that no ssh password between two NameNode

4. Isolation (Fence), in which the same time there is only one NameNode provide services

HDFS-HA automatic failover mechanism

Learning the use of the foregoing command hdfs haadmin -failover manually failover, in this mode, even if the active NameNode has failed, the system will not automatically transferred from active to standby NameNode NameNode, learn how to configure the following deployment HA automatic failover. Automatic failover two new components to HDFS deployment: ZooKeeper and ZKFailoverController (ZKFC) process, shown in Figure 3-20. ZooKeeper is to maintain a small amount of data coordination, notification and monitoring of the client to change client service failures availability of these data. Automatic transfer failure depends on HA ZooKeeper the following functions:

1 ) fault detection: the cluster each NameNode maintain a persistent session ZooKeeper, if the machine crashes, ZooKeeper the session is terminated, notice another ZooKeeper NameNode need to trigger a failover.

2 ) active NameNode selection: the ZooKeeper provides a simple mechanism for selecting a single node in the active state. If the current active duty NameNode crashes, another node may obtain a special exclusive lock from ZooKeeper to indicate that it should become active NameNode.

ZKFC another new automatic failover component is ZooKeeper client, also monitor and manage the state of NameNode. Each run NameNode host also runs a ZKFC process, ZKFC responsible for:

1 ) Health Monitoring: ZKFC use a health check command NameNode in the same host periodically with ping, as long as the state of health NameNode not reply, ZKFC think that the node is healthy. If the node crashes, freezes or enters an unhealthy state, health monitor identifies the node for non-health.

2 ) ZooKeeper session management: when the local NameNode is healthy, ZKFC to keep an open ZooKeeper in the session. If the local NameNode in the active state, ZKFC also maintained a special znode lock which uses ZooKeeper support for transient nodes, if a session is terminated, the node will lock automatically deleted.

3 ) based on ZooKeeper choice: If the local NameNode is healthy, and ZKFC found no other nodes currently holding znode lock, it will acquire the lock for himself. If successful, it has won the selection, and is responsible for running the failover process so that its local NameNode to Active. Failover and manual failover process described above is similar to the transfer, if the first active NameNode before the need to protect, and then converted into local NameNode Active state.

In the Hadoop 1.x, Namenode is a single point of failure cluster, once Namenode fails, the entire cluster is unavailable, restart or start a new Namenode be able to recover from it. It is worth mentioning that, Secondary Namenode does not provide the ability to failover. Availability cluster is affected in:

  • When the machine fails, such as when the power failure, the administrator must reboot to recover Namenode available.
  • In routine maintenance upgrade, you need to stop Namenode, will lead to a cluster is unavailable for some time.

Architecture

Hadoop HA (High Available) by simultaneously Namenode arranged in two Active / Passive mode to solve the above problems, it is called the Active Namenode and Standby Namenode. Standby Namenode as a hot spare, allowing for fast failover when a machine fault occurs, Namenode simultaneously switch between elegant way when routine maintenance. Namenode can configure a main one, no more than two Namenode.

Namenode master handles all operation requests (read and write), but only as a Slave Standby, as synchronization maintenance state, can be quickly switched to Standby when such failure. In order to make the Active Standby Namenode Namenode data synchronized with both Namenode are in communication with a set Journal Node. Namenode performed when the main namespace operations tasks, will ensure durable to modify the majority of the journal Journal Node node. Standby Namenode continually monitor these edit, when the detected change, these changes will apply to their own namespace.

When failover, Standby before becoming Active Namenode, will ensure that they have read all the edit log Journal Node in order to be consistent with the state of the data before a failure occurs.

To ensure that failover can be completed quickly, Standby Namenode need to maintain the latest Block location information that is stored on each Block copy of which node in the cluster. To achieve this, Datanode configure two standby Namenode, and simultaneously transmitted to both the heartbeat and Report Block Namenode.

To ensure that any time there is only a very important Namenode in the Active state, otherwise data loss or data corruption. When two Namenode think when their Active Namenode, will attempt to write data at the same time (not go to detection and synchronization data). To prevent this phenomenon split brain, Journal Nodes Namenode allows only a write data, the number of maintenance is controlled by the internal Epoch, to safely failover.

There are two ways to edit log can be shared:

  • Using NFS Share edit log (stored on NAS / SAN)
  • Use QJM shared edit log

Using NFS shared storage

 

 

As shown, NFS as a shared storage Namenode standby. This approach may appear brain split (split-brain), that is, both nodes think they are the main Namenode and try to write to edit log, this may result in data corruption. By configuring fencin script to solve this problem, fencing script:

  • The shutdown before Namenode
  • Before the ban Namenode continue to access shared edit log file

With this scheme, the administrator can manually trigger Namenode switch, and then upgrade maintenance. But there is a problem in this way:

  • Only manual failover, each failure requires the administrator to take measures to switch.
  • NAS / SAN deployment disposed complex, error-prone, and the NAS itself is a single point of failure.
  • Fencing is very complex, often misconfigured.
  • It can not be solved accident (unplanned) accidents, such as hardware or software failure

Therefore we need another way to address these issues:

  • Automatic failover (introduced ZooKeeper achieve automation)
  • Remove the dependence on external software and hardware (NAS / SAN)
  • While addressing the accident and routine maintenance caused unavailable

Quorum-based storage + ZooKeeper

QJM (Quorum Journal Manager) is a component designed for Namenode shared storage Hadoop development. Running a set of clusters which the Node Journal, each node Journal exposed a simple RPC interface, allowing Namenode read and write data, data stored in the local disk Journal node. When writing Namenode edit log, all Journal Node it sends the write request to the cluster, when the majority of nodes back to confirm successful write, edit log is considered to be successfully written. 3 for example, Journal Node, Namenode if an acknowledgment message from node 2, the write is considered successful.

In the fail-transfer process, the introduction of state monitoring Namenode ZookeeperFailController (ZKFC). ZKFC typically runs on the host machine Namenode, done in collaboration with the automatic transfer Zookeeper cluster failure. The entire cluster architecture is as follows:

 

 

QJM

Namenode使用QJM 客户端提供的RPC接口与Namenode进行交互。写入edit log时采用基于仲裁的方式,即数据必须写入JournalNode集群的大部分节点。
在Journal Node节点上(服务端)
服务端Journal运行轻量级的守护进程,暴露RPC接口供客户端调用。实际的edit log数据保存在Journal Node本地磁盘,该路径在配置中使用dfs.journalnode.edits.dir属性指定。
Journal Node通过epoch数来解决脑裂的问题,称为JournalNode fencing。具体工作原理如下:
1)当Namenode变成Active状态时,被分配一个整型的epoch数,这个epoch数是独一无二的,并且比之前所有Namenode持有的epoch number都高。

2)当Namenode向Journal Node发送消息的时候,同时也带上了epoch。当Journal Node收到消息时,将收到的epoch数与存储在本地的promised epoch比较,如果收到的epoch比自己的大,则使用收到的epoch更新自己本地的epoch数。如果收到的比本地的epoch小,则拒绝请求。

3)edit log必须写入大部分节点才算成功,也就是其epoch要比大多数节点的epoch高。

 

 

这种方式解决了NFS方式的3个问题:

  • 不需要额外的硬件,使用原有的物理机
  • Fencing通过epoch数来控制,避免出错。
  • 自动故障转移:Zookeeper处理该问题。

使用Zookeeper进行自动故障转移

前面提到,为了支持故障转移,Hadoop引入两个新的组件:Zookeeper Quorum和ZKFailoverController process(简称ZKFC)。

Zookeeper的任务包括:

  • 失败检测: 每个Namnode都在ZK中维护一个持久性session,如果Namnode故障,session过期,使用zk的事件机制通知其他Namenode需要故障转移。
  • Namenode选举:如果当前Active namenode挂了,另一个namenode会尝试获取ZK中的一个排它锁,获取这个锁就表名它将成为下一个Active NN。

在每个Namenode守护进程的机器上,同时也会运行一个ZKFC,用于完成以下任务:

  • Namenode健康健康
  • ZK Session管理
  • 基于ZK的Namenode选举

如果ZKFC所在机器的Namenode健康状态良好,并且用于选举的znode锁未被其他节点持有,则ZKFC会尝试获取锁,成功获取这个排它锁就代表获得选举,获得选举之后负责故障转移,如果有必要,会fencing掉之前的namenode使其不可用,然后将自己的namenode切换为Active状态。

部署与配置

硬件资源

为了允许HA集群,需要以下资源:
1)Namenode机器:运行Active Namenode和Standby Namenode的机器配置应保持一样,也与不使用HA情况下的配置一样。
2)JournalNode机器:运行JournalNode的机器,这些守护进程比较轻量级,所以可以将其部署在Namenode或者YARN ResourceManager。至少需要部署3个Journalnode节点,以便容忍一个节点故障。通常配置成奇数,例如总数为N,则可以容忍(N-1)/2台机器发生故障后集群仍然可以正常工作。

需要注意的是,Standby Namenode同时完成了原来Secondary namenode的checkpoint功能,因此不需要独立再部署Secondary namenode。

HA配置

Nameservices: 服务的逻辑名称

<property>
  <name>dfs.nameservices</name>
  <value>mycluster</value>
</property>

 

Namenode配置:
dfs.ha.namenodes.[nameservices]: nameserviecs对应的namenode:

<property>
  <name>dfs.ha.namenodes.mycluster</name>
  <value>nn1,nn2</value> <!--目前最大只能2个-->
</property>

Namenode RPC地址:

<property>
  <name>dfs.namenode.rpc-address.mycluster.nn1</name>
  <value>machine1.example.com:8020</value>
</property>
<property>
  <name>dfs.namenode.rpc-address.mycluster.nn2</name>
  <value>machine2.example.com:8020</value>
</property>

 

Namenode HTTP Server配置:

<property>
  <name>dfs.namenode.http-address.mycluster.nn1</name>
  <value>machine1.example.com:50070</value>
</property> <!--如果启用了Hadoop security,需要使用https-address-->
<property>
  <name>dfs.namenode.http-address.mycluster.nn2</name>
  <value>machine2.example.com:50070</value>
</property>

 

edit log保存目录,也就是Journal Node集群地址,分号隔开:

<property>
  <name>dfs.namenode.shared.edits.dir</name>
 <value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value>
</property>

 

 

客户端故障转移代理类,目前只提供了一种实现:

<property>
  <name>dfs.client.failover.proxy.provider.mycluster</name>
  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

 

edit日志保存路径:

<property>
  <name>dfs.journalnode.edits.dir</name>
  <value>/path/to/journal/node/local/data</value>
</property>

fencing方法配置:

<property>
      <name>dfs.ha.fencing.methods</name>
      <value>sshfence</value>
    </property>

    <property>
      <name>dfs.ha.fencing.ssh.private-key-files</name>
      <value>/home/exampleuser/.ssh/id_rsa</value>
 </property>

 

虽然在使用QJM作为共享存储时,不会出现同时写入的脑裂现象。但是旧的namenode依然可以接受读请求,这可能会导致数据过时,直到原有namenode尝试写入journal node时才关机。因此也推荐配置一种合适的fencing方法。

部署启动

配置完成之后,使用如下命令启动JQM集群:

hadoop-daemon.sh  start  journalnode

 

配置并启动Zookeeper集群,与常规的而配置方式完成一样,主要包括数据保存位置、节点id、时间配置等,在zoo.cfg中配置。这里不列出详细步骤。使用之前,需要格式化zk的文件:

hdfs zkfc -formatZK

 

格式化Namenode:

hdfs  namenode -format

 

启动两个namenode:

//master
hadoop-daemon.sh start namenode27
//备用namenode上
hdfs namenode -bootstrapStandby

其他组件的启动方式与常规方式一样。

Guess you like

Origin www.cnblogs.com/tesla-turing/p/11957811.html