Hadoop2.7.1高可用环境搭建

Hadoop基础知识: http://hadoop.apache.org/docs/r1.0.4/cn/quickstart.html
注:本文中缺少的图可以下载附件DOC
Hadoop2.7.1集群搭建
1.系统配置
电脑1(Lenovo),win7 64位系统,8G内存,此电脑虚拟机上运行name系统。
电脑1(Lenovo ),win7 64位系统,8G内存,此电脑虚拟机上运行standyname系统
电脑1(Lenovo),win7 64位系统,8G内存,此电脑虚拟机上运行amrm系统
虚拟机:Vmware12.0
Hadoop2.7.1
Zookeeper3.4.6
2.集群规划
其具体规划如下:
JournalServer 应该单纯一台,slaves文件中为JournalNode(存储name的元数据)
journalServer and journalNode 中配置zookeeper,name和standy name
主机名 IP 安装软件 运行的进程
name	192.168.32.137	Jdk,hadoop
zookeeper	namenode、DFSZKFailoverController、datanode、jobhistorysever、NodeManager、JournalNode、QuorumPeerMain
sname	192.168.32.135	Jdk,hadoop
zookeeper	Namenode、DFSZKFailoverController,datanode、NodeManager、JournalNode、QuorumPeerMain
amrm	192.168.32.136	Jdk,hadoop
zookeeper	datanode、NodeManager、JournalNode、QuorumPeerMain,ResourceManager

说明:
在hadoop2.0中通常由两个name组成,一个处于active状态,另一个处于standby状态。Active name对外提供服务,而Standby name则不对外提供服务,仅同步activename的状态,以便能够在它失败时快速进行切换。hadoop2.0官方提供了两种HDFS HA的解决方案,一种是NFS,另一种是QJM。这里我们使用简单的QJM。在该方案中,主备name之间通过一组JournalNode同步元数据信息,一条数据只要成功写入多数JournalNode即认为写入成功。通常配置奇数个JournalNode
这里还配置了一个zookeeper集群,用于ZKFC(DFSZKFailoverController)故障转移,当
Active name挂掉了,会自动切换Standby name为standby状态。

1)在name,sname,amrm命令行vim /etc/hostname中分别设置name,sname,amrm的主机名,如下图所示:
 


2)在name,sname,amrm命令行vim /etc/hosts 中设置name,sname,amrm主机名和ip地址的对应关系,如下图所示:



3)验证各系统之间是否能够ping通。
4)安装SSH 并产生公私钥在name上:(可以copy ~/.ssh 到 sname和amrm,统一公私钥)
 ssh-keygen  -t  dsa  -P  ''  -f  ~/.ssh/id_dsa
 cat  ~/.ssh/id_dsa.pub  >>  ~/.ssh/authorized_keys

      拷贝公钥到sname,amrm做同样的动作(最好统一公私钥)
scp  -r  /root/.ssh   root@sname:/root/
scp  -r  /root/.ssh/id_dsa.pub  root@sname:/root/.ssh/id_dsa.pub
scp  -r  /root/.ssh/id_dsa.pub  root@amrm:/root/.ssh/id_dsa.pub

检查 ssh sname amrm 保证互相访问不需要密码 ,如果slaves文件中包括自己那么还要执行
 ssh name

---------------------------------------------------------------------------------------------------------------------------------------------
Scp 命令:
// scp from source to destination(local)
scp root@data:/root/.ssh/id_dsa.pub  ~/.ssh/data_dsa.pub
// scp from source(local) to destination
scp  -r  /root/.ssh/id_dsa.pub  root@amrm:/root/.ssh/id_dsa.pub

---------------------------------------------------------------------------------------------------------------------------------------------
注:scp 在ssh通的情况下用
错误:
   -1. Please contact your system administrator.
Add correct host key in /root/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /root/.ssh/known_hosts:2
  remove with:
  ssh-keygen -f "/root/.ssh/known_hosts" -R sname

执行:
ssh-keygen -f "/root/.ssh/known_hosts" -R sname

或删除
/root/.ssh/known_hosts
的第2行。

-2. Warning: the ECDSA host key for 'sname' differs from the key for the IP address '192.168.32.138'
Offending key for IP in /root/.ssh/known_hosts:2
Matching host key in /root/.ssh/known_hosts:5
Are you sure you want to continue connecting (yes/no)? yes
Welcome to Ubuntu 15.10 (GNU/Linux 4.2.0-16-generic x86_64)
* Documentation:  https://help.ubuntu.com/
82 packages can be updated.
42 updates are security updates.
Last login: Fri Dec 11 22:30:00 2015 from 192.168.32.138

解决:删除/root/.ssh/known_hosts的第2行。
-3. Your id_dsa is 755 cann’t used
chmod 700 ~/.ssh/id_dsa
(私钥文件权限)



5)关ip6
-1.
cat /proc/sys/net/ipv6/conf/all/disable_ipv6

       显示0说明ipv6开启,1说明关闭 

-2在 /etc/sysctl.conf 增加下面几行,并重启。
#disable IPv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

     -3. sudo vim  /etc/default/grub
     -4. 将文件中的  
GRUB_CMDLINE_LINUX_DEFAULT="quiet spalsh" 
 
修改为
     
GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 quiet splash"

     -5. wq保存后,运行sudo update-grub更新
     -6. 重启网络服务,禁用ipv6成功
       可以使用
                 
 ip a | grep inet6

       查看关闭情况,若没有结果则说明禁用IPv6成功

3.安装配置zookeeper集群
1)解压zookeeper压缩包到/hadoop
 tar –zxvf zookeeper-3.4.6.tar.gz   /hadoop
 mv /hadoop/zookeeper-3.4.6  /hadoop/zookeeper-3.4.6

2)在/hadoop/zookeeper-3.4.6/conf修改zookeeper配置zoo.cfg,具体配置如下图所示:




3)在/hadoop/zookeeper-3.4.6中设置创建tmp目录
Mkdir /hadoop/zookeeper-3.4.6/tmp

4)在/hadoop/zookeeper-3.4.6/tmp目录中创建空文件myid,并写入4
vim  /hadoop/zookeeper-3.4.6/tmp/myid。

5)将配置好的zookeeper拷贝到sname和amrm
scp -r /hadoop/zookeeper-3.4.6  root@sname:/hadoop/zookeeper-3.4.6
scp -r /hadoop/zookeeper-3.4.6  root@amrm:/hadoop/zookeeper-3.4.6

6)在sname和amrm中分别修改myid为2和3。
4.安装配置hadoop集群
1)解压hadoop压缩包到/hadoop
tar -zxvf hadoop-2.7.1.tar.gz  /hadoop

2)安装hadoop在~/.bashrc中配置hadoop的环境变量信息,如下图所示:



# the variable for hadoop
export JAVA_HOME=/usr/lib/java/jdk1.7.0_79
export JRE_HOME=${JAVA_HOME}/jre
export CLASS_PATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export HADOOP_HOME=/hadoop/hadoop-2.7.1
export ZOOKEEPER_HOME=/hadoop/zookeeper-3.4.6
export PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${ZOOKEEPER_HOME}/bin:${PATH}
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP}/lib/native
export YARN_HOME=${HADOOP_HOME}
export HADOOP_OPT="-Djava.library.path=${HADOOP_HOME}/lib/native"

5.配置hadoop
hadoop2.7.1的所有配置文件从存在/hadoop/hadoop-2.7.1/etc/hadoop之中。
cd /hadoop/hadoop-2.7.1/etc/hadoop

1)修改hadoop-env.sh 加入jdk家目录
export  JAVA_HOME=/usr/lib/java/jdk1.7.0_79

2)修改core-site.xml
<configuration>
		<!-- 指定hdfs的nameservice为ns -->
		<property>
			<name>fs.defaultFS</name>
			<value>hdfs://ns</value>
		</property>
		<!-- 指定hadoop临时目录  -->
		<property>
			<name>hadoop.tmp.dir</name>
			<value>/hadoop/tmp </value>
		</property>
		<!-- 指定zookeeper地址 -->
		<property>
			<name>ha.zookeeper.quorum</name>
			<value>name:2181,sname:2181,amrm:2181</value>
		</property>
	</configuration>

3)修改hdfs-site.xml  // 
<configuration>
		<!--指定hdfs的nameservice为ns,需要和core-site.xml中的保持一致 -->
		<property>
			<name>dfs.nameservices</name>
			<value>ns</value>
		</property>
		<!-- ns下面有两个name,分别是nm,snm -->
		<property>
			<name>dfs.ha.names.ns</name>
			<value>nm,snm</value>
		</property>
		<!-- nm的RPC通信地址 -->
		<property>
			<name>dfs.name.rpc-address.ns.nm</name>
			<value>name:9000</value>
		</property>
		<!-- nm的http通信地址 -->
		<property>
			<name>dfs.name.http-address.ns.nm</name>
			<value>name:50070</value>
		</property>
		<!-- snm的RPC通信地址 -->
		<property>
			<name>dfs.name.rpc-address.ns.snm</name>
			<value>sname:9000</value>
		</property>
		<!-- snm的http通信地址 -->
		<property>
			<name>dfs.name.http-address.ns.snm</name>
			<value>sname:50070</value>
		</property>
    <!-- hadoop.tmp.dir 在core-site.xml中设置这里不用设,否者则添加如下两个属性 -->
     <property>
			<name>dfs.name.name.dir</name>
			<value>/hadoop/dfs/name</value>
		</property>
     <property>
			<name>dfs.name.data.dir</name>
			<value>/hadoop/dfs/data</value>
		</property> 
		<!-- 指定name的元数据在JournalNode上的存放位置 加入amrm集群更健壮-->
		<property>
			<name>dfs.name.shared.edits.dir</name>	
         <value>qjournal://name:8485;sname:8485;amrm:8485/ns</value>
		</property>
		<!-- 指定JournalNode在本地磁盘存放数据的位置 -->
		<property>
			<name>dfs.journalnode.edits.dir</name>
			<value>/hadoop/journal</value>
		</property>
		<!-- 开启name失败自动切换 -->
		<property>
			<name>dfs.ha.automatic-failover.enabled</name>
			<value>true</value>
		</property>
		<!-- 配置失败自动切换实现方式 -->
		<property>
			<name>dfs.client.failover.proxy.provider.ns</name>
			<value>org.apache.hadoop.hdfs.server.name.ha.ConfiguredFailoverProxyProvider</value>
		</property>
		<!-- 配置隔离机制方法,多个机制用换行分割,即每个机制暂用一行-->
		<property>
			<name>dfs.ha.fencing.methods</name>
			<value>
				sshfence
				shell(/bin/true)
			</value>
		</property>
		<!-- 使用sshfence隔离机制时需要ssh免登陆 -->
		<property>
			<name>dfs.ha.fencing.ssh.private-key-files</name>
			<value>/root/.ssh/id_dsa</value>
		</property>
		<!-- 配置sshfence隔离机制超时时间 -->
		<property>
			<name>dfs.ha.fencing.ssh.connect-timeout</name>
			<value>30000</value>
		</property>
	</configuration>

4)修改mapred-site.xml
<configuration>
		<!-- 指定mr框架为yarn方式 -->
		<property>
			<name>mapreduce.framework.name</name>
			<value>yarn</value>
		</property>
     <!-- 启动historyserver  -->
            <property>
                 <name>mapreduce.jobhistory.address</name>
                 <value>name:10020</value>
            </property>

            <property>
                  <name>mapreduce.jobhistory.webapp.address</name>
                  <value>name:19888</value>
            </property>
            <!--dir为分布式文件系统中的文件目录,启动时先启动dfs,在启动historyserver -->
            <property>
                   <name>mapreduce.jobhistory.intermediate-done-dir</name>
                   <value>/history/indone</value>
            </property>
            <!--dir为分布式文件系统中的文件目录,启动时先启动dfs,在启动historyserver -->
            <property>
                  <name>mapreduce.jobhistory.done-dir</name>
                  <value>/history/done</value>
           </property>
	</configuration>	

5)修改yarn-site.xml
<configuration>
		<!-- 指定resourcemanager地址 -->
		<property>
			<name>yarn.resourcemanager.hostname</name>
			<value>amrm</value>
		</property>
     <!--ResourceManager 对客户端暴露的地址。
      客户端通过该地址向RM提交应用程序,杀死应用程序等-->
		<property>
			<name>yarn.resourcemanager.address</name>
			<value>${yarn.resourcemanager.hostname}:8032</value>
		</property>
   <!--ResourceManager 对ApplicationMaster暴露的访问地址。
      ApplicationMaster通过该地址向RM申请资源、释放资源等。-->
		<property>
			<name>yarn.resourcemanager.scheduler.address</name>
			<value>${yarn.resourcemanager.hostname}:8030</value>
		</property>
  <!-- ResourceManager 对NodeManager暴露的地址。
          NodeManager通过该地址向RM汇报心跳,领取任务等。-->
		<property>
			<name>yarn.resourcemanager.resource-tracker.address</name>
			<value>${yarn.resourcemanager.hostname}:8031</value>
		</property>
   <!--ResourceManager 对管理员暴露的访问地址。
          管理员通过该地址向RM发送管理命令等。默认值:${yarn.resourcemanager.hostname}:8033-->
		<property>
			<name>yarn.resourcemanager.admin.address</name>
			<value>${yarn.resourcemanager.hostname}:8033</value>
		</property>
    <!--ResourceManager对外web ui地址-->
		<property>
			<name>yarn.resourcemanager.webapp.address</name>
			<value>${yarn.resourcemanager.hostname}:8088</value>
		</property>

		<!-- 指定nodemanager启动时加载server的方式为shuffle server -->
		<property>
			<name>yarn.nodemanager.aux-services</name>
			<value>mapreduce_shuffle</value>
		</property>
	</configuration>

6)修改slaves
slaves是指定子节点的位置,因为要在name上启动HDFS、在amrm启动yarn,所以name上的slaves文件指定的是datanode的位置,amrm上的slaves文件指定的是nodemanager的位置
cd /hadoop/hadoop-2.7.1/tmp/hadoop/etc/hadoop/
vim slaves
name
sname
amrm

注:name中slaves为amrm和journalnode的地址,amrm中slaves为nodeamananger的地址。
6.将配置好的hadoop拷贝到sname和amrm
scp  -r /hadoop/hadoop-2.7.1/tmp root@amrm:/hadoop/hadoop-2.7.1/tmp

scp  -r /hadoop/hadoop-2.7.1/tmp  root@sname:/hadoop/hadoop-2.7.1/tmp/
scp  -r /hadoop/hadoop-2.7.1/tmp  root@amrm:/hadoop/hadoop-2.7.1/tmp/

*********************注意:以下操作必须严格按照顺序*****************************
7.启动zookeeper集群,(在name,sname,amrm的/hadoop/hadoop-2.7.1/tmp/zk/bin/里开启)
cd /hadoop/hadoop-2.7.1/tmp/zk/bin  // 按顺序启动name,sname,amrm
./zkServer.sh start(启动zookeeper节点)
./zkServer.sh status(查看zookeeper状态)

8.启动journalnode,(在name,sname,amrm的/hadoop/hadoop-2.7.1/tmp/hadoop/sbin里启动)//在name中启动即可 非hadoop-daemon.sh
hadoop-daemons.sh start journalnode

jps(依次在每个节点中查看各节点是否多了Journalnode进程)
9.格式化HDFS,在name上执行格式化命令
hdfs  namenode  -format  ns

格式化后会在根据core-site.xml中的hadoop.tmp.dir配置生成个文件,这里我配置的是/hadoop/hadoop-2.7.1/tmp,然后将/hadoop/hadoop-2.7.1/tmp拷贝到sname和amrm的/hadoop/hadoop-2.7.1/tmp下。
scp  -r /hadoop/hadoop-2.7.1/dfs  root@sname:/hadoop/hadoop-2.7.1
scp  -r /hadoop/hadoop-2.7.1/dfs  root@amrm:/hadoop/hadoop-2.7.1

注:格式化生成的目录不要轻易删除,否者启动回报不一致异常
10.格式化ZK,在name上执行格式化命令
hdfs  zkfc  -formatZK

11.启动HDFS,在name的/hadoop/hadoop-2.7.1/tmp/hadoop/sbin中执行start-dfs.sh命令
cd  /hadoop/hadoop-2.7.1/sbin/
start-dfs.sh

启动之后,分别进入name,sname,amrm中jps,查看是否多了name 和 DFSZKFailoverController两个进程(name,sname)
12.启动 historyserver  在name中的/hadoop/hadoop-2.7.1/tmp/hadoop/sbin中执行,
hdfs dfs -mkdir /history
hdfs dfs -mkdir /history/indone
hdfs dfs -mkdir /history/done
mr-jobhistory-daemon.sh  start historyserver

13.启动YARN
在 amrm 中的/hadoop/hadoop-2.7.1/tmp/hadoop/sbin中执行start-yarn.sh命令
cd  /hadoop/hadoop-2.7.1/sbin/
start-yarn.sh

是在amrm上执行start-yarn.sh,把name和resourcemanager分开是因为性能问题,因为他们都要占用大量资源,所以把他们分开了,他们分开了就要分别在不同的机器上启动
14.到此,hadoop2.7.0的配置完毕,可以通过浏览器访问来查看部署是否成功
   1) http://192.168.32.137:50070 namenode
     


   2) http://192.168.32.136:8088  resourcemanager




    3) http://192.168.32.137:19888  jobhistroysever




15.执行job
  
1)hdfs  dfs  -mkdir /test
   2)hdfs  dfs  -mkdir /test/input
   3)hdfs  dfs  -put  etc/hadoop/*.xml  /test/input
   4)hadoop jar  share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep /test/input /test/output 'dfs[a-z.]+'
     [img]http://dl2.iteye.com/upload/attachment/0122/5366/817ca7cf-a5b0-307c-bdd8-40ede16677f4.png[/img]        
   5) hdfs  dfs  -get /test/output   output  //当前目录
   6) cat   output/* 查看结果





备注:另外一种查看结果的方式
  hdfs dfs -cat /test/output/*





查看job状态:









Jobhistorysever:







16.关闭hadoop
在amrm中
           
          
 stop-yarn.sh

在name中
        
  mr-jobhistory-daemon.sh stop historyserver
            stop-dfs.sh


17.
hadoop dfsadmin -safemode leave

注:以上过程有什么问题,可以查看相关日志文件
相关异常
1. org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
因为掉电,导致hadoop 的HA 出现 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby 此问题,原来从web 页面查看,是两个节点都变成了standy,所以要切换
              
 hdfs haadmin -transitionToActive --forcemanual nm

2. org.apache.hadoop.ipc.Client: Retrying connect to server: amrm/192.168.32.136:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
telnet: Unable to connect to remote host: Connection refused  Ubuntu 15.10

查看能否ping通,查看端口是否开放,如果能ping通,同时端口开放,用如下命令查看系统端口监听
netstat -ntulp
确保local Address的地址为0.0.0.0 或192.168.32.137。
解决办法 修改/etc/hosts 地址映射





3. org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category WRITE is not supported in state standby
name 处于standby状态



4. org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://sname:9000/user/root/grep-temp-1382738569
这个是由于map的产生的文件放在分布式文件系统/user/${username}中新建
 hdfs  dfs  -mkdir /user
 hdfs  dfs  -mkdir /user/${username}


5. java.io.IOException: Unknown Job job_1450012188054_0001 at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.verifyAndGetJob(HistoryClientService.java:218)
at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.getCounters(HistoryClientService.java:232) at org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getCounters(MRClientProtocolPBServiceImpl.java:159) at org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:281)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)




解决办法
:hdfs dfs -chmod -R  777  /history

6



解决方式:在/etc/hosts中,添加jamel地址映射。
注:
1.Job 成功的显示输出结果
15/12/13 22:17:44 INFO mapreduce.Job: Job job_1450012188054_0002 completed successfully
15/12/13 22:17:45 INFO mapreduce.Job: Counters: 50
	File System Counters
		FILE: Number of bytes read=493
		FILE: Number of bytes written=1176179
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=33949
		HDFS: Number of bytes written=663
		HDFS: Number of read operations=30
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Killed map tasks=3
		Launched map tasks=12
		Launched reduce tasks=1
		Data-local map tasks=12
		Total time spent by all maps in occupied slots (ms)=1450715
		Total time spent by all reduces in occupied slots (ms)=112387
		Total time spent by all map tasks (ms)=1450715
		Total time spent by all reduce tasks (ms)=112387
		Total vcore-seconds taken by all map tasks=1450715
		Total vcore-seconds taken by all reduce tasks=112387
		Total megabyte-seconds taken by all map tasks=1485532160
		Total megabyte-seconds taken by all reduce tasks=115084288
	Map-Reduce Framework
		Map input records=926
		Map output records=17
		Map output bytes=508
		Map output materialized bytes=541
		Input split bytes=969
		Combine input records=17
		Combine output records=15
		Reduce input groups=15
		Reduce shuffle bytes=541
		Reduce input records=15
		Reduce output records=15
		Spilled Records=30
		Shuffled Maps =9
		Failed Shuffles=0
		Merged Map outputs=9
		GC time elapsed (ms)=67395
		CPU time spent (ms)=15090
		Physical memory (bytes) snapshot=1492398080
		Virtual memory (bytes) snapshot=6682742784
		Total committed heap usage (bytes)=1178963968
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=32980
	File Output Format Counters 
		Bytes Written=663


2.本文所搭建的是高可用对于namenode而言,而RM HA可以访问如下地址:
ResourceMananger HA 访问-
http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html

猜你喜欢

转载自donald-draper.iteye.com/blog/2302217
今日推荐