I love beijing
i love cina
beijing is the captial of china
统计的时候,如下图:
注意上图中,最左边的偏移量第一个为1,然后I LOVE CHINA中,I是第4个单词了;所以偏移量为4;
然后进行分词,
然后K3就是把每个单词归类的KEY,然后V3(1,1),就是说I这个单词,统计的频率;
相关的mapper的编写:
public class WordCountMapper extends Mapper<LongWritable, Text, Text, LongWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { /* * key: 输入的key * value: 数据 I love Beijing * context: Map上下文 */ String data= value.toString(); //分词 String[] words = data.split(" "); //输出每个单词 for(String w:words){ context.write(new Text(w), new LongWritable(1)); } } }
reduce:
public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable>{ @Override protected void reduce(Text k3, Iterable<LongWritable> v3,Context context) throws IOException, InterruptedException { //v3: 是一个集合,每个元素就是v2 long total = 0; for(LongWritable l:v3){ total = total + l.get(); } //输出 context.write(k3, new LongWritable(total)); } }
主程序:
public class WordCountMain { public static void main(String[] args) throws Exception { //创建一个job = map + reduce Configuration conf = new Configuration(); //创建一个Job Job job = Job.getInstance(conf); //指定任务的入口 job.setJarByClass(WordCountMain.class); //指定job的mapper job.setMapperClass(WordCountMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(LongWritable.class); //指定job的reducer job.setReducerClass(WordCountReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); //指定任务的输入和输出 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); //提交任务 job.waitForCompletion(true); }