本地编写wordcount过程
第一步——编写源码
编写自己的wordcount代码,这是编写最终的源码。
MyMap
package MyWordCount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MyMap extends Mapper<LongWritable, Text, Text, IntWritable>{
Text k = new Text();
IntWritable v = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// 1 获取一行
String line = value.toString();
// 2 切割获取的字符串
String[] words = line.split(" ");
// 3 输出到ReduceTask
for(String word:words) {
k.set(word);
context.write(k, v);
}
}
}
MyReduce
package MyWordCount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MyReduce extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> value, Context context) throws IOException, InterruptedException {
// 1 累加求和
int sum = 0;
for (IntWritable i : value) {
sum += i.get();
}
// 2 输出
context.write(key, new IntWritable(sum));
}
}
MyDriver
package MyWordCount;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MyDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// 1 获取配置信息
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 2 设置jar加载路径
job.setJarByClass(MyDriver.class);
// 3 设置map和reduce类
job.setMapperClass(MyMap.class);
job.setReducerClass(MyReduce.class);
// 4 设置map输出
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 5 设置reduce输出
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 6 设置输入和输出路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// 7 提交job
boolean result = job.waitForCompletion(true);
// 8 打印结果(可有可无)
System.exit(result ? 0 : 1);
}
}
第二步——本地运行
这一步遇到了各种报错信息,接下来记录我一一解决的过程。
增加 log4j.properties 文件在项目src文件下
报错提示如下:
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
log4j.properties 文件内容为
log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
更换老版本的JRE
新版的有警告,不过程序可以运行。
警告就不贴了。
修改NativeIO.java源码
新建org.apache.hadoop.io.nativeio包在项目src文件下,将修改的NativeIO.java放进去就可以了。
此时需要用JRE1.8,新版的会报错 sun.misc.Cleaner 找不到该类型。
public static boolean access(String path, AccessRight desiredAccess)
throws IOException {
return true;
// return access0(path, desiredAccess.accessRight());
// 该函数在Windows静态类下面
}
导入错了包
应该 import org.apache.hadoop.io.Text;
而不是 import com.sun.jersey.core.impl.provider.entity.XMLJAXBElementProvider.Text;
报错信息如下:
java.lang.ClassCastException:classcom.sun.jersey.core.impl.provider.entity.XMLJAXBElementProvider$Text
atjava.lang.Class.asSubclass(Unknown Source)