MapReduce案例

1．倒排索引
倒排索引是文档检索系统中最常用的数据结构，被广泛地应用于全文搜索引擎。它主要是用来存储某个单词（或词组）在一个文档或一组文档中的存储位置的映射，即提供了一种根据内容来查找文档的方式。由于不是根据文档来确定文档所包含的内容，而是进行相反的操作，因而称为倒排索引（ Inverted Index）。
1.1．实例描述
通常情况下，倒排索引由一个单词（或词组）以及相关的文档列表组成，文档列表中的文档或者是标识文档的ID号，或者是指文档所在位置的 URL。如下图所示：
MapReduce案例
从上图可以看出，单词1出现在{文档 1，文档 4，文档 13， ……}中，单词2出现在{文档 3，文档 5，文档 15， ……}中，而单词3出现在{文档 1，文档 8，文档 20， ……}中。在实际应用中，还需要给每个文档添加一个权值，用来指出每个文档与搜索内容的相关度，如下图所示：
MapReduce案例
最常用的是使用词频作为权重，即记录单词在文档中出现的次数。以英文为例，如下图所示，索引文件中的“ MapReduce”一行表示：“ MapReduce”这个单词在文本 T0 中出现过 1 次，××× 中出现过 1 次，T2 中出现过 2 次。当搜索条件为“ MapReduce”、“ is”、“ Simple” 时，对应的集合为： {T0， ×××， T2}∩{T0， ×××}∩{T0， ×××}={T0， ×××}，即文档 T0 和 ××× 包含了所要索引的单词，而且只有 T0 是连续的。
MapReduce案例
1.2．设计思路
1）Map过程
首先使用默认的 TextInputFormat 类对输入文件进行处理，得到文本中每行的偏移量及其内容。显然， Map过程首先必须分析输入的<key,value>对，得到倒排索引中需要的三个信息：单词、文档 URL 和词频，如下图所示。
MapReduce案例
这里存在两个问题：第一， <key,value>对只能有两个值，在不使用 Hadoop 自定义数据类型的情况下，需要根据情况将其中两个值合并成一个值，作为 key 或 value 值；
第二，通过一个 Reduce 过程无法同时完成词频统计和生成文档列表，所以必须增加一个 Combine 过程完成词频统计。
这里将单词和 URL 组成 key 值（如“ MapReduce： file1.txt”），将词频作为 value，这样做的好处是可以利用 MapReduce 框架自带的 Map 端排序，将同一文档的相同单词的词频组成列表，传递给 Combine 过程，实现类似于 WordCount 的功能。
2）Combine过程
经过map方法处理后， Combine过程将key值相同value值累加，得到一个单词在文档中的词频。如果直接将图所示的输出作为 Reduce 过程的输入，在Shuffle过程时将面临一个问题：所有具有相同单词的记录（由单词、 URL 和词频组成）应该交由同一个 Reducer 处理，但当前的 key 值无法保证这一点，所以必须修改 key值和value值。这次将单词作为 key 值， URL 和词频组成 value值（如“ file1.txt： 1”）。这样做的好处是可以利用 MapReduce框架默认的HashPartitioner 类完成 Shuffle过程，将相同单词的所有记录发送给同一个 Reducer 进行处理。
MapReduce案例
3）Reduce过程
经过上述两个过程后， Reduce 过程只需将相同 key 值的 value 值组合成倒排索引文件所需的格式即可，剩下的事情就可以直接交给 MapReduce 框架进行处理了。

1.3．程序代码
InvertedIndexMapper：

public class InvertedIndexMapper extends Mapper<LongWritable, Text, Text, Text> {

private static Text keyInfo = new Text();// 存储单词和 URL 组合  

private static final Text valueInfo = new Text("1");// 存储词频,初始化为1 

@Override  

protected void map(LongWritable key, Text value, Context context) 

        throws IOException, InterruptedException { 

    String line = value.toString();  

    String[] fields = StringUtils.split(line, " ");// 得到字段数组  

    FileSplit fileSplit = (FileSplit) context.getInputSplit();// 得到这行数据所在的文件切片  

    String fileName = fileSplit.getPath().getName();// 根据文件切片得到文件名  

    for (String field : fields) { 

        // key值由单词和URL组成，如“MapReduce:file1”  

        keyInfo.set(field + ":" + fileName);  

        context.write(keyInfo, valueInfo);  

    } 

}

}
InvertedIndexCombiner：

public class InvertedIndexCombiner extends Reducer<Text, Text, Text, Text> {

private static Text info = new Text();  

// 输入： <MapReduce:file3 {1,1,...}> 

// 输出：<MapReduce file3:2> 

@Override  

protected void reduce(Text key, Iterable<Text> values, Context context) 

        throws IOException, InterruptedException { 

    int sum = 0;// 统计词频  

    for (Text value : values) { 

        sum += Integer.parseInt(value.toString());  

    } 

    int splitIndex = key.toString().indexOf(":");  

    // 重新设置 value 值由 URL 和词频组成  

    info.set(key.toString().substring(splitIndex + 1) + ":" + sum);  

    // 重新设置 key 值为单词  

    key.set(key.toString().substring(0, splitIndex));  

    context.write(key, info);  

}

}
InvertedIndexReducer：

public class InvertedIndexReducer extends Reducer<Text, Text, Text, Text> {

private static Text result = new Text();  

// 输入：<MapReduce file3:2> 

// 输出：<MapReduce file1:1;file2:1;file3:2;> 

@Override  

protected void reduce(Text key, Iterable<Text> values, Context context) 

        throws IOException, InterruptedException { 

    // 生成文档列表  

    String fileList = new String();  

    for (Text value : values) { 

        fileList += value.toString() + ";";  

    } 

    result.set(fileList);  

    context.write(key, result);  

}

}
InvertedIndexRunner：

public class InvertedIndexRunner {

public static void main(String[] args) throws IOException, 

        ClassNotFoundException, InterruptedException { 

    Configuration conf = new Configuration();  

    Job job = Job.getInstance(conf);  

    job.setJarByClass(InvertedIndexRunner.class);  

    job.setMapperClass(InvertedIndexMapper.class);  

    job.setCombinerClass(InvertedIndexCombiner.class);  

    job.setReducerClass(InvertedIndexReducer.class);  

    job.setOutputKeyClass(Text.class);  

    job.setOutputValueClass(Text.class);  

    FileInputFormat.setInputPaths(job, new Path(args[0]));  

    // 检查参数所指定的输出路径是否存在，若存在，先删除  

    Path output = new Path(args[1]);  

    FileSystem fs = FileSystem.get(conf);  

    if (fs.exists(output)) { 

        fs.delete(output, true);  

    } 

    FileOutputFormat.setOutputPath(job, output);  

    System.exit(job.waitForCompletion(true) ? 0 : 1);  

}

}

2．数据去重
数据去重主要是为了掌握和利用并行化思想来对数据进行有意义的筛选。统计大数据集上的数据种类个数、从网站日志中计算访问地等这些看似庞杂的任务都会涉及数据去重。
2.1．实例描述
对数据文件中的数据进行去重。数据文件中的每行都是一个数据。比如原始输入数据为：
File1:
2017-3-1 a
2017-3-2 b
2017-3-3 c
2017-3-4 d
2017-3-5 a
2017-3-6 b
2017-3-7 c
2017-3-3 c
File2：
2017-3-1 b
2017-3-2 a
2017-3-3 b
2017-3-4 d
2017-3-5 a
2017-3-6 c
2017-3-7 d
2017-3-3 c
输出结果为:
2017-3-1 a
2017-3-1 b
2017-3-2 a
2017-3-2 b
2017-3-3 b
2017-3-3 c
2017-3-4 d
2017-3-5 a
2017-3-6 b
2017-3-6 c
2017-3-7 c
2017-3-7 d
2.2．设计思路
数据去重的最终目标是让原始数据中出现次数超过一次的数据在输出文件中只出现一次。我们自然而然会想到将同一个数据的所有记录都交给一台 reduce机器，无论这个数据出现多少次，只要在最终结果中输出一次就可以了。具体就是reduce 的输入应该以数据作为 key，而对 value-list 则没有要求。当 reduce 接收到一个<key， value-list>时就直接将 key复制到输出的 key 中，并将 value 设置成空值。
在 MapReduce 流程中， map的输出<key，value>经过 shuffle 过程聚集成<key， value-list>后会交给 reduce。所以从设计好的 reduce 输入可以反推出 map 的输出 key 应为数据， value任意。继续反推， map 输出数据的 key 为数据，而在这个实例中每个数据代表输入文件中的一行内容，所以 map 阶段要完成的任务就是在采用 Hadoop 默认的作业输入方式之后，将value 设置为 key，并直接输出（输出中的 value 任意）。 map 中的结果经过 shuffle 过程之后交给 reduce。 reduce 阶段不会管每个 key 有多少个 value，它直接将输入的 key 复制为输出的 key，并输出就可以了（输出中的 value 被设置成空了）。
2.3．程序代码
Mapper：

public class DedupMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

private static Text field = new Text();  

@Override  

protected void map(LongWritable key, Text value, Context context) 

        throws IOException, InterruptedException { 

    field = value;  

    context.write(field, NullWritable.get());  

}

[align=left]}
Reducer：

public class DedupReducer extends

    Reducer<Text, NullWritable, Text, NullWritable> { 

@Override  

protected void reduce(Text key, Iterable<NullWritable> values, 

        Context context) throws IOException, InterruptedException { 

    context.write(key, NullWritable.get());  

}

}
Runner：

public class DedupRunner {

public static void main(String[] args) throws IOException, 

        ClassNotFoundException, InterruptedException { 

    Configuration conf = new Configuration();  

    Job job = Job.getInstance(conf);  

    job.setJarByClass(DedupRunner.class);  

    job.setMapperClass(DedupMapper.class);  

    job.setReducerClass(DedupReducer.class);  

    job.setOutputKeyClass(Text.class);  

    job.setOutputValueClass(NullWritable.class);  

    FileInputFormat.setInputPaths(job, new Path(args[0]));  

    FileOutputFormat.setOutputPath(job, new Path(args[1]));  

    job.waitForCompletion(true);  

}

[align=left]}
3． Top N
Top-N分析法是指从研究对象中得到所需的N个数据，并对这N个数据进行重点分析的方法。那么该如何利用MapReduce来解决在海量数据中求Top N个数。
3.1．实例描述
对数据文件中的数据取最大top-n。数据文件中的每个都是一个数据。
原始输入数据为：
10 3 8 7 6 5 1 2 9 4
11 12 17 14 15 20
19 18 13 16
输出结果为（最大的前5个）：
20
19
18
17
16
3.2．设计思路
要找出top N, 核心是能够想到reduce Task个数一定只有一个。
因为一个map task就是一个进程,有几个map task就有几个中间文件，有几个reduce task就有几个最终输出文件。我们要找的top N 是指的全局的前N条数据，那么不管中间有几个map, reduce最终只能有一个reduce来汇总数据，输出top N。
Mapper过程
使用默认的mapper数据，一个input split（输入分片）由一个mapper来处理。
在每一个map task中，我们找到这个input split的前n个记录。这里我们用TreeMap这个数据结构来保存top n的数据，TreeMap默认按键的自然顺序升序进行排序。下一步，我们来加入新记录到TreeMap中去。在map中，我们对每一条记录都尝试去更新TreeMap，最后我们得到的就是这个分片中的local top n的n个值。
以往的mapper中，我们都是处理一条数据之后就context.write一次。而在这里是把所有这个input split的数据处理完之后再进行写入。所以，我们可以把这个context.write放在cleanup里执行。cleanup就是整个mapper task执行完之后会执行的一个函数。
TreeMap 是一个有序的key-value集合，默认会根据其键的自然顺序进行排序，也可根据创建映射时提供的 Comparator 进行排序。其firstKey()方法用于返回当前这个集合第一个(最低)键。
Reducer过程
只有一个reducer，就是对mapper输出的数据进行再一次汇总，选出其中的top n，即可达到我们的目的。注意的是，Treemap默认是正序排列数据，要想满足求取top n倒序最大的n个，需要实现自己的Comparator（）方法。
3.3．程序代码
TopNMapper：

private TreeMap<Integer, String> repToRecordMap = new TreeMap<Integer, String>();

@Override

public void map(LongWritable key, Text value, Context context) {

    String line = value.toString();

    String[] nums = line.split(" ");

    for (String num : nums) {

        repToRecordMap.put(Integer.parseInt(num), " ");

        if (repToRecordMap.size() > 5) {

            repToRecordMap.remove(repToRecordMap.firstKey());

        }

    }

@Override

protected void cleanup(Context context) {

    for (Integer i : repToRecordMap.keySet()) {

        try {

            context.write(NullWritable.get(), new IntWritable(i));

        } catch (Exception e) {

            e.printStackTrace();

        }

    }

}

TopNReducer：

private TreeMap<Integer, String> repToRecordMap = new TreeMap<Integer, String>(new Comparator<Integer>() {

    /* 

    * int compare(Object o1, Object o2) 返回一个基本类型的整型，  

    * 返回负数表示：o1 小于o2，  

    * 返回0 表示：o1和o2相等，  

    * 返回正数表示：o1大于o2。  

    * 谁大谁排后面

    */

    public int compare(Integer a, Integer b) {

        return b - a;

    }

});

public void reduce(NullWritable key, Iterable<IntWritable> values, Context context)

        throws IOException, InterruptedException {

    for (IntWritable value : values) {

        repToRecordMap.put(value.get(), " ");

        if (repToRecordMap.size() > 5) {

            repToRecordMap.remove(repToRecordMap.firstKey());

        }

    }

    for (Integer i : repToRecordMap.keySet()) {

        context.write(NullWritable.get(), new IntWritable(i));

    }

}

TopNRunner：

Configuration conf = new Configuration();

    Job job = Job.getInstance(conf);

    job.setJarByClass(TopNRunner.class);

    job.setMapperClass(TopNMapper.class);

    job.setReducerClass(TopNReducer.class);

    job.setNumReduceTasks(1);

    job.setMapOutputKeyClass(NullWritable.class);// map阶段的输出的key

    job.setMapOutputValueClass(IntWritable.class);// map阶段的输出的value

    job.setOutputKeyClass(NullWritable.class);// reduce阶段的输出的key

    job.setOutputValueClass(IntWritable.class);// reduce阶段的输出的value

    FileInputFormat.setInputPaths(job, new Path("D:\\topN\\input"));

    FileOutputFormat.setOutputPath(job, new Path("D:\\topN\\output"));

    boolean res = job.waitForCompletion(true);

    System.exit(res ? 0 : 1);

猜你喜欢