Hadoop入门(二十四)Mapreduce的求TopK程序

一、简介

求TopK是算法中最常使用到的,现在使用Mapreduce在海量数据中统计数据的求TopK。

二、例子

(1)实例描述
给出三个文件,每个文件中都存储了若干个数值,求所有数值中的求Top 5。

样例输入:                                            
1)file1:  

1
2
3
7
9
-99
2


2)file2:  

11
2
23
17
9
199
22


3)file3:  

21
12
3
17
2
39
12


 期望输出:

199
39
23
22
21

(2)问题分析
实现统计海量数据的求TopK,不能将所有的数据加载到内存,计算只能使用类似外部排序的方式,加载一部分数据统计求TopK,接着加载另一部分进行统计TopK。

(3)实现步骤

1)Map过程 
    首先使用默认的TextInputFormat类对输入文件进行处理,得到文本中每行的偏移量及其内容。显然,Map过程首先必须分析输入的<key,value>对,得到数值,然后在mapper中统计单个分块的求TopK。

扫描二维码关注公众号,回复: 8790449 查看本文章

2)Reduce过程 
    经过map方法处理后,Reduce过程将获取每个mapper的求TopK进行统计,分行统计出总的TopK。

(3)关键代码

package com.mk.mapreduce;


import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URI;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

public class TopK {

    public static class TopKMapper extends Mapper<LongWritable, Text, IntWritable, NullWritable> {

        private List<Integer> top5 = new ArrayList<>(5);

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            if (StringUtils.isBlank(value.toString())) {
                System.out.println("空白行");
                return;
            }

            Integer v = Integer.valueOf(value.toString().trim());

            if(top5.size()<5){
                top5.add(v);
            }else{
                Integer min = Collections.min(top5);
                if (min < v) {
                    top5.remove(min);
                    top5.add(v);
                }

            }
        }

        @Override
        protected void cleanup(Context context) throws IOException, InterruptedException {
            for (Integer v : top5)
                context.write(new IntWritable(v), NullWritable.get());
        }
    }


    public static class TopKReducer extends Reducer< IntWritable, NullWritable,IntWritable, NullWritable> {

        private List<Integer> top5 = new ArrayList<>(5);
        @Override
        protected void reduce(IntWritable key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
            Integer v = key.get();

            if(top5.size()<5){
                top5.add(v);
            }else{
                Integer min = Collections.min(top5);
                if (min < v) {
                    top5.remove(min);
                    top5.add(v);
                }

            }

        }

        @Override
        protected void cleanup(Context context) throws IOException, InterruptedException {
            top5.sort((a,b)->b-a);
            for (Integer v : top5)
                context.write(new IntWritable(v), NullWritable.get());
        }
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        String uri = "hdfs://192.168.150.128:9000";
        String input = "/topk/input";
        String output = "/topk/output";
        Configuration conf = new Configuration();
        if (System.getProperty("os.name").toLowerCase().contains("win"))
            conf.set("mapreduce.app-submission.cross-platform", "true");

        FileSystem fileSystem = FileSystem.get(URI.create(uri), conf);
        Path path = new Path(output);
        fileSystem.delete(path, true);

        Job job = new Job(conf, "TopK");
        job.setJar("./out/artifacts/hadoop_test_jar/hadoop-test.jar");
        job.setJarByClass(TopK.class);
        job.setMapperClass(TopKMapper.class);
        job.setReducerClass(TopKReducer.class);
        job.setMapOutputKeyClass(IntWritable.class);
        job.setMapOutputValueClass(NullWritable.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(NullWritable.class);
        FileInputFormat.addInputPaths(job, uri + input);
        FileOutputFormat.setOutputPath(job, new Path(uri + output));


        boolean ret = job.waitForCompletion(true);
        System.out.println(job.getJobName() + "-----" + ret);
    }
}
发布了354 篇原创文章 · 获赞 522 · 访问量 128万+

猜你喜欢

转载自blog.csdn.net/moakun/article/details/102654216