版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/u011320740/article/details/78900151
mapreduce原理我就不讲了,这篇已经讲过
这篇学习如何通过java来编写一个mapreduce模型的wordcount程序用于统计单词出现个数
所需的jar包与上一篇一致
编码
TokenizerMapper.java
package com.cwh.mapreduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
@Override
protected void map(Object key, Text value,Context context)throws IOException, InterruptedException {
//拿到一行文本内容,转换成String 类型
String line = value.toString();
//将这行文本切分成单词
String[] words=line.split(" ");
for(String word:words){
context.write(new Text(word), new IntWritable(1));
}
}
}
package com.cwh.mapreduce;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
protected void reduce(Text key, Iterable<IntWritable> value,Context context)throws IOException, InterruptedException {
Iterator<IntWritable> values = value.iterator();
int count = 0;
while(values.hasNext()){
count += values.next().get();
}
context.write(key, new IntWritable(count));
}
}
WordCount.java
package com.cwh.mapreduce;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job wordCountJob = Job.getInstance(conf);
//重要:指定本job所在的jar包
wordCountJob.setJarByClass(WordCount.class);
//设置wordCountJob所用的mapper逻辑类为哪个类
wordCountJob.setMapperClass(TokenizerMapper.class);
//设置wordCountJob所用的reducer逻辑类为哪个类
wordCountJob.setReducerClass(IntSumReducer.class);
//设置map阶段输出的kv数据类型
wordCountJob.setMapOutputKeyClass(Text.class);
wordCountJob.setMapOutputValueClass(IntWritable.class);
//设置最终输出的kv数据类型
wordCountJob.setOutputKeyClass(Text.class);
wordCountJob.setOutputValueClass(IntWritable.class);
//设置要处理的文本数据所存放的路径
FileInputFormat.setInputPaths(wordCountJob, "hdfs://192.168.27.131:9000/hdfsTest/");
FileOutputFormat.setOutputPath(wordCountJob, new Path("hdfs://192.168.27.131:9000/hdfsTest/output/"));
//提交job给hadoop集群
boolean flag = wordCountJob.waitForCompletion(true);
if (flag){
System.out.println("操作成功!");
}else {
System.out.println("操作失败!");
}
System.exit(1);
}
}
运行测试
我是windows下eclipse开发运行的,所以会报如下错误
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries
我们只需下载
https://github.com/srccodes/hadoop-common-2.2.0-bin这个文件然后解压,环境变量配置下,重启电脑即可
添加HADOOP_HOME
然后path添加:%HADOOP_HOME%\bin
classpath添加:%HADOOP_HOME%\bin\winutils.exe;
接着运行会报权限错误,我干脆把hdfs的权限关闭即可;
修改hdfs-site.xml,添加如下内容,修改后需要重启hadoop
<configuration>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
上一篇写到的是上传了一个文件到hdfsTest目录下名为text.txt,现在我们直接用它即可,text.txt内容如下:
运行后,查看hadoop客户端如下:
可看到生成了两个文件,part-r-0000就是我们结果文件,可下载打开查看:
ok!至此我们就实现了个简单的wordcount