了解hadoop历史及其主要应用于哪些场景;
搭建基本的hadoop环境,能够简单使用hadoop;
对于javaweb开发人员来说,如何将hadoop的应用于实际开发中;
关于hadoop:
hadoop分为几大部分:yarn负责资源和任务管理、hdfs负责分布式存储、map-reduce负责分布式计算
准备工作:
1.创建hadoop用户
创建hadoop用户
sudo useradd -m hadoop -s /bin/bash
设置密码
sudo passwd hadoop
赋予管理员权限
sudo adduser hadoop sudo
2.更新apt
3.安装ssh及配置无密登录ssh
ssh localhost
exit # 退出刚才的 ssh localhost
cd ~/.ssh/ # 若没有该目录,请先执行一次ssh localhost
ssh-keygen -t rsa # 会有提示,都按回车就可以
cat ./id_rsa.pub >> ./authorized_keys # 加入授权
4.安装jdk
sudo apt-get install openjdk-7-jre openjdk-7-jdk
dpkg -L openjdk-7-jdk | grep '/bin/javac'
获取jdk安装路径:/usr/lib/jvm/java-7-openjdk-amd64
vim ~/.bashrc
vi o(下一行) O(上一行)
第一行添加:export JAVA_HOME=JDK安装路径
source ~/.bashrc # 使变量设置生效
安装hadoop:
下载:
sudo wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.6.1/hadoop-2.6.1.tar.gz
sudo tar -zxf ..../hadoop-2.6.1.tar.gz -C /usr/local # 解压到/usr/local中
cd /usr/local/
sudo mv ./hadoop-2.6.1/ ./hadoop # 将文件夹名改为hadoop
sudo chown -R hadoop ./hadoop # 修改文件权限
检查是否安装成功:
cd /usr/local/hadoop
./bin/hadoop version
Hadoop伪分布式配置:
/usr/local/hadoop/etc/hadoop/>core-site.xml:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/usr/local/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/tmp/dfs/data</value>
</property>
</configuration>
配置完成后,nameNode格式化:
./bin/hdfs namenode -format
添加hadoop到系统路径,便于平时操作:
vi ~/.bashrc
#最上面一行添加下列
export PATH=$PATH:/usr/local/hadoop/sbin:/usr/local/hadoop/bin
source ~/.bashrc
运行Hadoop伪分布式实例-shell-统计文件中单词个数
#启动nameNode和nodeName守护线程:
/usr/local/hadoop>./sbin/start-dfs.sh
jps查看是否启动成功
http://localhost:50070 查看 NameNode 和 Datanode 信息
#创建工作目录
/usr/local/hadoop>./bin/hdfs dfs -mkdir -p /user/hadoop
#在上面创建目录基础上创建目录
./bin/hdfs dfs -mkdir input
#将 ./etc/hadoop 中的 xml 文件作为输入文件复制到分布式文件系统中
./bin/hdfs dfs -put ./etc/hadoop/*.xml input
#查看hadoop中文件
./bin/hdfs dfs -ls input
#统计input文件中,特定单词的个数
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+'
#查看运行结果的命令
./bin/hdfs dfs -cat output/*
#将运行结果取回本地
rm -r ./output # 先删除本地的 output 文件夹(如果存在)
./bin/hdfs dfs -get output ./output # 将 HDFS 上的 output 文件夹拷贝到本机
cat ./output/*
# 删除 output 文件夹
./bin/hdfs dfs -rm -r output
#关闭hadoop:
./sbin/stop-dfs.sh
运行java实例-打包到linux调试-统计一年中最高的温度
#根目录下创建输入文件
vi input.txt
#数据格式如:
2012010919 2012011023 2001010116 2001010212 2001010310
#放至hadoop根目录
./bin/hdfs dfs -put input.txt /
或
hadoop fs -put input.txt /
#实例程序参考:
https://git.oschina.net/gitosc_20160331/template-hadoop.git
通过eclipse导出jar包,并上传到linux即hadoop服务器上,
执行hadoop jar ./template-hadoop.jar
Before Mapper:0,2014010114======After Mapper:2014, 14 Before Mapper:11,Before Mapper:12,2014010216======After Mapper:2014, 16 Before Mapper:23,Before Mapper:24,2014010317======After Mapper:2014, 17 Before Mapper:35,Before Mapper:36,2014010410======After Mapper:2014, 10 Before Mapper:47,Before Mapper:48,2014010506======After Mapper:2014, 6 Before Mapper:59,Before Mapper:60,2012010609======After Mapper:2012, 9 Before Mapper:71,Before Mapper:72,2012010732======After Mapper:2012, 32 Before Mapper:83,Before Mapper:84,2012010812======After Mapper:2012, 12 Before Mapper:95,Before Mapper:96,2012010919======After Mapper:2012, 19 Before Mapper:107,Before Mapper:108,2012011023======After Mapper:2012, 23 Before Mapper:119,Before Mapper:120,2001010116======After Mapper:2001, 16 Before Mapper:131,Before Mapper:132,2001010212======After Mapper:2001, 12 Before Mapper:143,Before Mapper:144,2001010310======After Mapper:2001, 10 Before Mapper:155,Before Mapper:156,2001010411======After Mapper:2001, 11 Before Mapper:167,Before Mapper:168,2001010529======After Mapper:2001, 29 Before Mapper:179,Before Mapper:180,2013010619======After Mapper:2013, 19 Before Mapper:191,Before Mapper:192,2013010722======After Mapper:2013, 22 Before Mapper:203,Before Mapper:204,2013010812======After Mapper:2013, 12 Before Mapper:215,Before Mapper:216,2013010929======After Mapper:2013, 29 Before Mapper:227,Before Mapper:228,2013011023======After Mapper:2013, 23 Before Mapper:239,Before Mapper:240,2008010105======After Mapper:2008, 5 Before Mapper:251,Before Mapper:252,2008010216======After Mapper:2008, 16 Before Mapper:263,Before Mapper:264,2008010337======After Mapper:2008, 37 Before Mapper:275,Before Mapper:276,2008010414======After Mapper:2008, 14 Before Mapper:287,Before Mapper:288,2008010516======After Mapper:2008, 16 Before Mapper:299,Before Mapper:300,2007010619======After Mapper:2007, 19 Before Mapper:311,Before Mapper:312,2007010712======After Mapper:2007, 12 Before Mapper:323,Before Mapper:324,2007010812======After Mapper:2007, 12 Before Mapper:335,Before Mapper:336,2007010999======After Mapper:2007, 99 Before Mapper:347,Before Mapper:348,2007011023======After Mapper:2007, 23 Before Mapper:359,Before Mapper:360,2010010114======After Mapper:2010, 14 Before Mapper:371,Before Mapper:372,2010010216======After Mapper:2010, 16 Before Mapper:383,Before Mapper:384,2010010317======After Mapper:2010, 17 Before Mapper:395,Before Mapper:396,2010010410======After Mapper:2010, 10 Before Mapper:407,Before Mapper:408,2010010506======After Mapper:2010, 6 Before Mapper:419,Before Mapper:420,2015010649======After Mapper:2015, 49 Before Mapper:431,Before Mapper:432,2015010722======After Mapper:2015, 22 Before Mapper:443,Before Mapper:444,2015010812======After Mapper:2015, 12 Before Mapper:455,Before Mapper:456,2015010999======After Mapper:2015, 99 Before Mapper:467,Before Mapper:468,2015011023======After Mapper:2015, 23 17/05/13 00:49:30 INFO mapred.LocalJobRunner: 17/05/13 00:49:30 INFO mapred.MapTask: Starting flush of map output 17/05/13 00:49:30 INFO mapred.MapTask: Spilling map output 17/05/13 00:49:30 INFO mapred.MapTask: bufstart = 0; bufend = 360; bufvoid = 104857600 17/05/13 00:49:30 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214240(104856960); length = 157/6553600 17/05/13 00:49:30 INFO mapred.MapTask: Finished spill 0 17/05/13 00:49:30 INFO mapred.Task: Task:attempt_local314864252_0001_m_000000_0 is done. And is in the process of committing 17/05/13 00:49:30 INFO mapred.LocalJobRunner: map 17/05/13 00:49:30 INFO mapred.Task: Task 'attempt_local314864252_0001_m_000000_0' done. 17/05/13 00:49:30 INFO mapred.LocalJobRunner: Finishing task: attempt_local314864252_0001_m_000000_0 17/05/13 00:49:30 INFO mapred.LocalJobRunner: map task executor complete. 17/05/13 00:49:30 INFO mapred.LocalJobRunner: Waiting for reduce tasks 17/05/13 00:49:30 INFO mapred.LocalJobRunner: Starting task: attempt_local314864252_0001_r_000000_0 17/05/13 00:49:30 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 17/05/13 00:49:30 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@14b73571 17/05/13 00:49:30 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=363285696, maxSingleShuffleLimit=90821424, mergeThreshold=239768576, ioSortFactor=10, memToMemMergeOutputsThreshold=10 17/05/13 00:49:30 INFO reduce.EventFetcher: attempt_local314864252_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events 17/05/13 00:49:30 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local314864252_0001_m_000000_0 decomp: 442 len: 446 to MEMORY 17/05/13 00:49:30 INFO reduce.InMemoryMapOutput: Read 442 bytes from map-output for attempt_local314864252_0001_m_000000_0 17/05/13 00:49:30 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 442, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->442 17/05/13 00:49:30 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning 17/05/13 00:49:30 INFO mapred.LocalJobRunner: 1 / 1 copied. 17/05/13 00:49:30 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs 17/05/13 00:49:30 INFO mapred.Merger: Merging 1 sorted segments 17/05/13 00:49:30 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 435 bytes 17/05/13 00:49:30 INFO reduce.MergeManagerImpl: Merged 1 segments, 442 bytes to disk to satisfy reduce memory limit 17/05/13 00:49:30 INFO reduce.MergeManagerImpl: Merging 1 files, 446 bytes from disk 17/05/13 00:49:30 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce 17/05/13 00:49:30 INFO mapred.Merger: Merging 1 sorted segments 17/05/13 00:49:30 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 435 bytes 17/05/13 00:49:30 INFO mapred.LocalJobRunner: 1 / 1 copied. 17/05/13 00:49:30 INFO mapreduce.Job: map 100% reduce 0% 17/05/13 00:49:30 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords Before Reduce:2001,12,10,11,29,16,After Reduce:2001,29 Before Reduce:2007,23,19,12,12,99,After Reduce:2007,99 Before Reduce:2008,16,14,37,16,5,After Reduce:2008,37 Before Reduce:2010,10,6,14,16,17,After Reduce:2010,17 Before Reduce:2012,19,12,32,9,23,After Reduce:2012,32 Before Reduce:2013,23,29,12,22,19,After Reduce:2013,29 Before Reduce:2014,14,6,10,17,16,After Reduce:2014,17 Before Reduce:2015,23,49,22,12,99,After Reduce:2015,99
执行完后,可删除output,以便下次调试
hdfs dfs -rm -r /output
运行java实例-windows本地调试- 由于问题5一直不成功
为了调试过程中定位问题,添加log4j.properties
log4j.rootLogger=debug,stdout,R log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%5p - %m%n log4j.appender.R=org.apache.log4j.RollingFileAppender log4j.appender.R.File=mapreduce_test.log log4j.appender.R.MaxFileSize=1MB log4j.appender.R.MaxBackupIndex=1 log4j.appender.R.layout=org.apache.log4j.PatternLayout log4j.appender.R.layout.ConversionPattern=%p %t %c - %m%n log4j.logger.com.codefutures=DEBUG
#下载hadoop的eclipse插件,放至eclipse/plugins下
下载地址:https://pan.baidu.com/share/link?uk=3976278079&shareid=111343850
#下载winutils.exe放至hadoop_home/bin,hadoop.dll放至windows/system32
下载地址:https://github.com/amihalik/hadoop-common-2.6.0-bin
#一开始下了错误版本,导致报错UnsatisfiedLinkError
http://pan.baidu.com/s/1qWG7XxU
#配置HADOOP_HOME并添加执行路径%HADOOP_HOME%/bin;
#重启eclipse
配置windows-preferences-hadoop map/reduce,即hadoop的安装路径
将上面下载的hadoop-2.6.1.tar.gz解压至windows的某目录下,即为hadoop在windows的安装路径
配置hadoop服务器信息-打开windows-show view查找map
在打开的视图中新建hadoop位置信息new hadoop location
查看是否连接成功:
遇到的问题
1.Connection Refused
在hadoop服务器上执行netstat -tpnl 查看端口使用情况,发现9000端口只对本地放开,修改为对所有地址放开
core-site.xml修改:
<property>
<name>fs.defaultFS</name>
<value>hdfs://0.0.0.0:9000</value>
</property>
localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:
https://wiki.apache.org/hadoop/ConnectionRefused
2.
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
一开始通过windows eclipse本地调试:
// 输入路径 String dst = "hdfs://192.168.127.135:9000/input.txt"; //输出路径,必须是不存在的,空文件夹也不行。 String dstOut = "hdfs://192.168.127.135:9000/output";
下载winutils.exe放在hadoop-2.6.1/bin下
3.java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray
windows/system32下缺少hadoop.dll或hadoop.dll版本不对
4.HADOOP_HOME or hadoop.home.dir are not set
HADOOP_HOME设置了还是报这个错,无奈用下面这个解决了
#设置hadoop.home.dir
System.setProperty("hadoop.home.dir", "D:\\service\\hadoop\\hadoop-2.6.1");
Job job = new Job(hadoopConfig);
?-设置环境变量后,由于java虚拟机中已经加载了旧的变量,需重启eclipse才可以获取最新变量
System.out.println(System.getenv("HADOOP_HOME")); //设置HADOOP_HOME但就是获取不到,cmd set %HADOOP_HOME% System.out.println(System.getenv("path")); //有值
5. PrivilegedActionException as:Administrator (auth:SIMPLE) cause:org.apache.hadoop.util.Shell$ExitCodeException:
多种尝试:没一个可以的
5.1 #修改权限:
hadoop fs -chmod 777 /tmp
5.2 修改用户所有者
hadoop fs -chown -R hadoop:hadoop /tmp
5.3 启动yarn,当然也要在linux hadoop中配置yarn(mapred-site.xml)
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
不需要yarn时,需mv mapred-site.xml mapred-site.xml.template
hadoopConfig.set("hadoop.job.ugi", "hadoop");
System.setProperty("HADOOP_USER_NAME", "hadoop");
hadoopConfig.set("mapreduce.framework.name", "yarn");
hadoopConfig.set("yarn.resourcemanager.address", "192.168.127.135:8021");
5.4
选择"本地用户和组",展开"用户",找到系统管理员"Administrator",修改其为"hadoop"
hbase简单使用
#下载hbase
sudo wget http://apache.fayea.com/hbase/stable/hbase-1.2.5-bin.tar.gz
#解压
sudo tar -zxvf hbase-1.2.5-bin.tar.gz
#修改conf/hbase-site.xml
<configuration>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://127.0.0.1:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>127.0.0.1:2181</value>
</property>
</configuration>
hbase.rootdir配置的是hdfs地址,ip:port要和hadoop/core-site.xml中的fs.defaultFS保持一致
其中hbase.zookeeper.quorum是zookeeper的地址,可以配多个
#修改conf/hbase-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
#使用hbase自带的ZK
export HBASE_MANAGES_ZK=true
#启动hbase
sudo ./bin/start-hbase.sh
可能报错127.0.0.1: Permission denied, please try again.,
试了一下几种方式:
1.hbase_home/conf下执行
sudo chmod +x start-hbase.sh
2.注释掉hbase.zookeeper.quorum配置
3.sudo su
可以启动,但还是报错
#关闭hbase
sudo ./bin/stop-hbase.sh
常用命令:
start-dfs.sh
http://localhost:50070
stop-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver # 开启历史服务器,才能在Web中查看任务运行情况
http://192.168.127.135:8088/cluster
stop-yarn.sh
mr-jobhistory-daemon.sh stop historyserver
参考:
Hadoop安装教程_单机/伪分布式配置_Hadoop2.6.0/Ubuntu14.04
http://www.powerxing.com/install-hadoop/
Windows下使用Hadoop2.6.0-eclipse-plugin插件
https://my.oschina.net/muou/blog/408543
hadoop入门--简单的MapReduce案例
http://blog.csdn.net/zhangt85/article/details/42077281
Hadoop Eclipse Plug-in
https://wiki.apache.org/hadoop/EclipsePlugIn
Hadoop虽然强大,但不是万能的
http://database.51cto.com/art/201402/429789.htm
一文读懂Hadoop、HBase、Hive、Spark分布式系统架构
http://www.codeceo.com/article/understand-hadoop-hbase-hive-spark-distributed-system-architecture.html
HBase does not run after ./start-hbase.sh - Permission Denied?
http://stackoverflow.com/questions/21166542/hbase-does-not-run-after-start-hbase-sh-permission-denied
http://www.cnblogs.com/wukenaihe/archive/2013/03/15/2961029.html
HBase应用的常见异常
https://www.zybuluo.com/xtccc/note/91427