Three, zookeeper-- achieve HA NN and the RM

一, hdfs namenode HA

1 Overview

In hadoop1.0 when there is a single point of failure in namenode problem hdfs cluster, when namenode not available and they will cause the entire cluster hdfs service is unavailable. Also, if the temporary need for namenode design or other operations, and then stopped namenode, hdfs cluster can not be used.
By way of HA, single point of failure can be solved to some extent.

2, namenode HA main points

1) Metadata management needs to change:
memory, keep a journal metadata;
Edits log only namenode node Active state can do write operations;
two namenode can read edits;
sharing of edits in a shared storage management (qjournal and NFS achieve two mainstream);

2) the need for a state management function module
when hadoop implements a zkfailover, resident in each of the nodes namenode where each node namenode zkfailover responsible for monitoring where their use zk identify the state, when the state is required to switch to the zkfailover responsible for switching, it is necessary to prevent brain split phenomenon when switching.

3) must ensure that no password can ssh between two NameNode. For isolating the latter. Namenode to another node, the process of their total namenode killed by ssh way. Prevent split brain.

4) Isolation (Fence), in which the same time there is only one NameNode provide services

3, namenode HA automatic failover mechanism

namenode HA automatic failover in addition to two namenode, but also need to add two components: zookeeper cluster services, ZKFailoverController (ZKFC).

(1) ZKFC

It is a client zookeeper at the same time responsible for monitoring namenode state. ZKFC are running a process on each namenode.
1) Health Monitoring:
ZKFC use a health check command NameNode in the same host periodically with ping, as long as the state of health NameNode not reply, ZKFC think that the node is healthy. If the node crashes, freezes or enters an unhealthy state, health monitor identifies the node for non-health.
2) ZooKeeper session management:
when the local NameNode is healthy, ZKFC to keep an open ZooKeeper in the session. If the local NameNode in the active state, ZKFC also maintained a special znode lock which uses ZooKeeper node support for short-term (that is, temporary node). If the session is terminated, the node will lock automatically deleted.

ZKFC会在zookeeper上创建一个  /hadoop-ha/namenodeHA集群名称/ 这样一个节点,
该节点上有两个子节点:
ActiveBreadCrumb:
持久节点,节点的值中记录了  HA集群名称 active节点别名 active节点地址
主要用于其他想访问namenode服务的client用于获取active状态的namenode地址,所以必须得是持久节点。

ActiveStandbyElectorLock:
临时节点,节点的值中记录了  HA集群名称 active节点别名 active节点地址。
起到互斥锁的作用,只有获取到该节点的使用权,才能修改上面ActiveBreadCrumb节点的值。
因为是临时节点,所以当active namenode和zk保持连接,该节点就会一直存在,而standby的namenode也是和zk保持连接,但是发现该临时节点已存在,就明白已经有人占用了,所以它不会做任何事。当上面的active namenode发生问题,不正常了,ZKFC就会断开和zk的连接,那么临时节点就会消失。此时standby namenode就会重新创建该临时节点,相当于获得了锁,可以修改ActiveBreadCrumb的值。此时它自己也就顺理成章变成新的active namenode。

3) based on the selection of ZooKeeper:
If the local NameNode is healthy, and ZKFC found no other nodes currently holding znode lock, it will acquire the lock for himself. If successful, it has won the selection, and is responsible for running the failover process so that its local NameNode is active.

4, HA arranged

(1) Environmental Planning

Host computer Character
bigdata121/192.168.50.121 namenode, journalnode, datanode, UK
bigdata122/192.168.50.122 namenode, journalnode, UK
bigdata123/192.168.50.123 zk
Software version hadoop2.8.4,zookeeper3.4.10,centos7.2

jdk, zookeeper told not repeat the deployment, see the article before it

Basic environment configuration:
each machine add hostname resolution / etc / hosts
each host their own, as well as for the other two hosts are configured with the secret key free ssh to log
off the firewall and selinux

(2) deployment

You can see the full deployment hadoop previous article, focus on how namenode HA configuration here.
Modify the configuration file:
Core-the site.xml

<configuration>
        <!--指定namenode的HA集群的名字 -->
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://mycluster</value>
        </property>

        <!--指定hadoop中hdfs保存数据块和元数据块的目录-->
        <property>
                <name>hadoop.tmp.dir</name>
                <value>/opt/modules/HA/hadoop-2.8.4/data/ha_data</value>
        </property>

        <!--指定使用的zk集群的所有节点ip:port-->
        <property>
                <name>ha.zookeeper.quorum</name>
            <value>bigdata121:2181,bigdata122:2181,bigdata123:2181</value>
        </property>
</configuration>

hdfs-site.xml

<configuration>
        <!-- namenode完全分布式集群名称,名字需和core-site中定义的集群名字一样 -->
        <property>
                <name>dfs.nameservices</name>
                <value>mycluster</value>
        </property>

        <!-- 集群中NameNode节点都有哪些,这里是节点的别名 -->
        <property>
                <name>dfs.ha.namenodes.mycluster</name>
                <value>nn1,nn2</value>
        </property>

        <!-- nn1的RPC通信地址 -->
        <property>
                <name>dfs.namenode.rpc-address.mycluster.nn1</name>
                <value>bigdata121:9000</value>
        </property>

        <!-- nn2的RPC通信地址 -->
        <property>
                <name>dfs.namenode.rpc-address.mycluster.nn2</name>
                <value>bigdata122:9000</value>
        </property>

        <!-- nn1的http通信地址 -->
        <property>
                <name>dfs.namenode.http-address.mycluster.nn1</name>
                <value>bigdata121:50070</value>
        </property>

        <!-- nn2的http通信地址 -->
        <property>
                <name>dfs.namenode.http-address.mycluster.nn2</name>
                <value>bigdata122:50070</value>
        </property>

        <!-- 指定NameNode元数据在JournalNode上的存放位置,用于存放edits日志 ,多个节点用逗号隔开-->
        <property>
                <name>dfs.namenode.shared.edits.dir</name>
        <value>qjournal://bigdata121:8485;bigdata122:8485/mycluster</value>
        </property>

        <!-- 配置隔离机制,即同一时刻只能有一台服务器对外响应,有shell和sshfence两种方式,主要用于到down掉的namenode所在主机将进程kill掉
             防止脑裂的情况 -->
        <property>
                <name>dfs.ha.fencing.methods</name>
                <value>sshfence</value>
        </property>

        <!-- 使用隔离机制时需要ssh无秘钥登录到另外的主机上将namenode进程kill掉,这里指定私钥的路径-->
        <property>
                <name>dfs.ha.fencing.ssh.private-key-files</name>
                <value>/root/.ssh/id_rsa</value>
        </property>

        <!-- 声明journalnode服务器存储目录-->
        <property>
                <name>dfs.journalnode.edits.dir</name>
                <value>/opt/modules/HA/hadoop-2.8.4/data/jn</value>
        </property>

        <!-- 关闭权限检查-->
        <property>
                <name>dfs.permissions.enable</name>
                <value>false</value>
        </property>

        <!-- 访问代理类:client,mycluster,active配置失败自动切换实现方式。用于访问已配置HA的namenode-->
        <property>
                <name>dfs.client.failover.proxy.provider.mycluster</name>
                <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
        </property>

        <!-- 启动ha的自动故障转移,无需手动切换故障的namenode-->
        <property>
                <name>dfs.ha.automatic-failover.enabled</name>
                <value>true</value>
        </property>
</configuration>

Profile synchronization to each node. Use scp or rsync casual bar.

(3) Start the cluster

The first time you start:

cd /opt/modules/HA/hadoop-2.8.4

1)各个journalnode节点上启动journalnode服务
sbin/hadoop-daemon.sh start journalnode

2)nn1上格式化namenode,并启动
bin/hdfs namenode -format
sbin/hadoop-daemon.sh start namenode

3)nn2上通过启动的journalnode从nn1上同步namenode的数据到本地namenode
bin/hdfs namenode -bootstrapStandby

4)启动nn2
sbin/hadoop-daemon.sh start namenode

5)nn1上启动所有datanode
sbin/hadoop-daemons.sh start datanode

6)两台namenode上查看namenode状态
bin/hdfs haadmin -getServiceState nn1
bin/hdfs haadmin -getServiceState nn2
正常情况一个是active,一个是standby

7)手动转换成active和standby
bin/hdfs haadmin -transitionToActive namenode名称
bin/hdfs haadmin -transitionToStandby namenode名称
注意,如果需要手动切换,那么需要将hdfs-site.xml中的自动切换关掉。否则报错。
或者使用 --forceactive 进行强制转换。

After startup, you can manually active in namenode off, you can see just the namenode standby will automatically become active. And just turn off the namenode back online, it will become standby.

Second start:
direct to start-dfs.sh

(4) Why is there no SNN?

When we start a complete namenode of HA clusters, we found no SNN figure, naive I think that also need to manually start it manually start the first serve, the result being given.
View SNN boot log, you can find such a newspaper exception information

org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Failed to start secondary namenode
java.io.IOException: Cannot use SecondaryNameNode in an HA cluster. The Standby Namenode will perform checkpointing.
        at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.<init>(SecondaryNameNode.java:189)
        at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:690)

The meaning is clear, that is to say by the namenode standby SNN duties to complete, require the presence of the HA state of SNN. This is actually very reasonable, it can be said to take full advantage of the namenode standby, lest it busy there.

Two, yarn resourceManager HA

1, the working mechanism

In fact, ha namenode and above similar, but also to monitor the aid ZKFC RM.
Will be created on a zk / yarn-leader-election / yarn-node cluster name,
the following has two children: ActiveBreadCrumb, ActiveStandbyElectorLock
a similar role, not repeat speak. Substantially similar mechanism

2, HA arranged

(1) Planning

Host computer Character
bigdata121 zk, rm
bigdata122 zk, rm
bigdata123 zk

(2) Profile

yarn-site.xml

<configuration>

<!-- Site specific YARN configuration properties -->
        <!--指定reducer获取数据方式为shuffle机制-->
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>

        <!--启动日志聚集功能-->
        <property>
                <name>yarn.log-aggregation-enable</name>
                <value>true</value>
        </property>

        <!--指定日志保留时间为7天,单位是秒-->
        <property>
                <name>yarn.log-aggregation.retain-seconds</name>
                <value>604800</value>
        </property>

    <!--启用resourcemanager ha-->
    <property>
        <name>yarn.resourcemanager.ha.enabled</name>
        <value>true</value>
    </property>

    <!--声明两台resourcemanager的集群名称-->
    <property>
        <name>yarn.resourcemanager.cluster-id</name>
        <value>cluster-yarn1</value>
    </property>

    <!--声明两台resourcemanager的节点的别名-->
    <property>
        <name>yarn.resourcemanager.ha.rm-ids</name>
        <value>rm1,rm2</value>
    </property>

    <!--声明两台rm的地址-->
    <property>
        <name>yarn.resourcemanager.hostname.rm1</name>
        <value>bigdata121</value>
    </property>

    <property>
        <name>yarn.resourcemanager.hostname.rm2</name>
        <value>bigdata122</value>
    </property>

    <!--指定zookeeper集群的地址-->
    <property>
        <name>yarn.resourcemanager.zk-address</name>
        <value>bigdata121:2181,bigdata122:2181,bigdata123:2181</value>
    </property>

    <!--启用自动恢复,故障自动切换-->
    <property>
         <name>yarn.resourcemanager.recovery.enabled</name>
        <value>true</value>
    </property>

    <!--指定resourcemanager的状态信息存储在zookeeper集群-->
    <property>
        <name>yarn.resourcemanager.store.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
    </property>
</configuration>

Profile synchronization to other nodes.

(3) Start the cluster

bigdata121:启动yarn
sbin/start-yarn.sh

bigdata122:启动rm
sbin/yarn-daemon.sh start resourcemanager

查看服务状态:
bin/yarn rmadmin -getServiceState rm1
bin/yarn rmadmin -getServiceState rm2

Test methods and namenode similar, are not repeated here

Guess you like

Origin blog.51cto.com/kinglab/2447332