每次调试都打包上传到服务器,效率很低,所以可以在本地模拟运行,以第9节的代码为例,设置要处理的文本和输出目录为本地目录:
//设置要处理的文本数据存放路径 FileInputFormat.setInputPaths(wordCountJob, "d:/wordcount/srcdata"); //设置最终输出结果存放路径 FileOutputFormat.setOutputPath(wordCountJob, new Path("d:/wordcount/output"));
完整代码如下:
package com.wange; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCountJobSubmitter { public static void main(String[] args) throws Exception { //System.setProperty("hadoop.home.dir", "E:/soft/hadoop-2.4.1"); Configuration config = new Configuration(); // 是否在本地运行,本质上是一下两个参数。没设置则会在本地模拟运行,设置了就会提交到yarn运行 //config.set("mapreduce.framework.name", "yarn"); //config.set("yarn.resourcemanager.hostname", "hadoop-server-00:9000");// 运行在远程的yarn集群中 Job wordCountJob = Job.getInstance(config); //指定job所在的jar包 wordCountJob.setJarByClass(WordCountJobSubmitter.class); //设置mapper和reduce逻辑类 wordCountJob.setMapperClass(WordCountMapper.class); wordCountJob.setReducerClass(WordCountReducer.class); //设置map和reduce阶段输出的kv数据类型 wordCountJob.setMapOutputKeyClass(Text.class); wordCountJob.setMapOutputValueClass(IntWritable.class); wordCountJob.setOutputKeyClass(Text.class); wordCountJob.setOutputValueClass(IntWritable.class); //设置要处理的文本数据存放路径 //FileInputFormat.setInputPaths(wordCountJob, "hdfs://hadoop-server-00:9000/wordcount/srcdata/"); FileInputFormat.setInputPaths(wordCountJob, "d:/wordcount/srcdata"); //设置最终输出结果存放路径 //FileOutputFormat.setOutputPath(wordCountJob, new Path("hdfs://hadoop-server-00:9000/wordcount/output/")); FileOutputFormat.setOutputPath(wordCountJob, new Path("d:/wordcount/output")); // 提交给hadoop集群,true 是否要打印处理信息 wordCountJob.waitForCompletion(true); } }
然后运行main程序,会遇到一些小坑,运行不起来。
坑1:Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the co
解决方法:需要引入jar包
<dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-common</artifactId> <version>2.4.1</version> </dependency>
坑2:Exception in thread "main" java.lang.NullPointerException
解决方法:设置 System.setProperty("hadoop.home.dir", "E:/soft/hadoop-2.4.1"); 下载windows下的运行需要的库文件:https://pan.baidu.com/s/17lkdxPTcKeWN-puLEqqXKw 提取码: ds5k,将下载的文件解压到本地hadoop的bin目录下,此处的本地目录为:E:\soft\hadoop-2.4.1\bin
这样就可以完美运行了,在本地模拟运行,也可以使用hdfs的路径,如:
FileInputFormat.setInputPaths(wordCountJob, "hdfs://hadoop-server-00:9000/wordcount/srcdata/"); FileOutputFormat.setOutputPath(wordCountJob, new Path("hdfs://hadoop-server-00:9000/wordcount/output/"));
运行的时候会出现权限问题,需要登录到hdfs服务器,设置目录权限就可以了,设置权限命令为:hadoop fs -chmod 777 /wordcount