Hadoop learning of big data technology (4) - MapReduce

Table of contents

1. Overview of MapReduce

1. The core idea of ​​MapReduce

2. MapReduce programming model

2. Working principle of MapReduce

1. Sharding and formatting data sources

2. Execute MapTask

3. Execute the Shuffle process

4. Execute RecudeTack

5. Write to file

3. Case

1. Word frequency statistics

(1) InputFormat component

(2) Mapper component

(3) Reducer component

(4) Combiner component

(5) Operation mode of MapReduce

2. Inverted index

(1 Introduction

(2) Implementation of the Map stage

(3) Combine stage implementation

(4) Implementation of the Reduce phase

(5) Driver program main class implementation

(6) Result display

3. Data deduplication

(1) Implementation of the Map stage

(2) Implementation of the Reduce phase

(3) Driver program main class implementation

(4) Results display

4、TopN

(1) Case introduction

(2) Implementation of the Map stage

(3) Implementation of the Reduce phase

(4) Driver program main class implementation

(5) Result display


The code of the case has been put into the Baidu network disk, and you can extract it yourself if necessary.

http://link: https://pan.baidu.com/s/1Vcqn7-A5YWOMqhBLpr3I0A?pwd=759w extraction code: 759w

1. Overview of MapReduce

1. The core idea of ​​MapReduce

        The core idea of ​​MapReduce is "divide and conquer" , which is to decompose a task into multiple subtasks. There is no necessary interdependence between these subtasks and they can all be executed independently. Finally, the results of these subtasks are aggregated and merged.

2. MapReduce programming model

        As a programming model, MapReduce specializes in processing large-scale data parallel operations. This model draws on the idea of ​​functional programming. The program implementation process is realized through the map() function and the reduce() function. When using MapReduce to process computing tasks, each task is divided into two phases, the Map phase and the Reduce phase.

(1) Map stage: preprocessing for raw data.

(2) Reduce stage: Summarize the processing results of the Map stage, and finally get the final result.

        Process description: The first step is to convert the original data into the form of key-value pair <k1, v1>; the second step is to import the converted key-value pair <k1, v1> into the map() function, and the map() function According to the mapping rules, the key-value pair <k1, v1> is mapped to a series of key-value pairs <k2, v2> in the form of intermediate results; the third step is to form the key-value pairs <k2, v2> in the intermediate form into <k2, The {v2...}> form is passed to the reduce() function for processing, and the value of the key with the same result is merged together to generate a new key-value pair <k3,v3>. At this time, the key-value pair <k3 ,v3> is the final result.

2. Working principle of MapReduce

1. Sharding and formatting data sources

        The data source input to the Map stage needs to be fragmented and formatted.

(1) Fragmentation operation: Divide the source file into small data blocks of equal size, and then Hadoop will build a Map task for each fragment, and the task will run the custom map() function to process the data in the fragmentation. every record.

(2) Formatting operation: Format the divided shards into data in the form of key-value pairs <key, value>, where key represents the offset and value represents a line of content.

2. Execute MapTask

(1) Read stage: Map Task parses each key-value pair <k,v> from the input InputSplit through the RecordReader written by the user.

(2) Map stage: The parsed <k, v> is handed over to the map function written by the user for processing, and a new key-value pair <k, v> is generated.

(3) Collect stage: In the map function written by the user, after the data is processed, outputCollector.collect() is generally called to output the result, and the key-value pair <k, v> fragment is generated inside the function, and written as a ring memory buffer.

(4) Spill stage: If the ring buffer is full, MapReduce will write the data to the local disk to generate a temporary file. It should be noted here that before the data is written to the local disk, the data needs to be sorted once, and the data needs to be merged and compressed if necessary.

(5) Combine stage: After all data is processed, Map Task will merge all temporary files once to ensure that only one data file will be generated in the end.

3. Execute the Shuffle process

        Shuffle will distribute the processing result data output by Map Task to RecudeTask, and during the distribution process, the data will be partitioned and sorted by key.

4. Execute RecudeTack

(1) Copy stage: Recude will remotely copy a copy of data from each MapTask, and for a certain data, if its size exceeds a certain value, it will be written to disk, otherwise it will be stored in memory.

(2) Merge stage: while copying data remotely, RecudeTack will start two background threads to merge files in memory and disk respectively to prevent excessive memory usage or disk files.

(3) Sort stage: The user writes the reduce() method to input data that is a set of data aggregated by key. For I key the same data together. Hadoop adopts a strategy based on sorting. Since each MapTask has implemented partial sorting of its own processing results, the ReduceTask only needs to sort all the data once.

(4) Reduce phase: call the reduce() method on the sorted key-value pairs, and call the reduce() method once for the key-value pairs with equal keys. Each call will generate zero or more key-value pairs, and finally put these keys Value pairs are written to HDFS.

(5) Write stage: the reduce() function writes the calculation results into HDFS.

5. Write to file

        The MapReduce framework will automatically transfer the <key, value> generated by RecudeTack to the write method of OutputFormat to implement the file writing operation.

3. Case

1. Word frequency statistics

        Here we use the case of word frequency statistics to briefly understand the related components of MapReduce.

(1) InputFormat component

        This component is mainly used to describe the format of input data, and it provides the following two functions.

a. Data segmentation: According to the strategy, the input data is divided into several fragments in order to determine the number of MapTasks and the corresponding fragments.

b. Provide input data source for Mapper: Given a certain slice, parse it into key-value pairs <k, v>.

        Hadoop comes with an InputFormat interface, which defines the code as follows.

public abstract class InputFormat <K, V> {
        public abstract List<InputFormat>getSplits(JobContext context) throws IOException, InterruptedException;
        public abstract RecordReader <K, V> createRecordReader (InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException;
    }

         The InputFormat interface defines two methods: getSplits() and createRecordReader(). The getSplits() method is responsible for dividing the file into multiple fragments. The createRecordReader() method is responsible for creating a RecordReader object to obtain data from the fragments.

(2) Mapper component

        The MapReduce program will generate multiple map tasks according to the input file. The Mapper class implements an abstract class of the Map task. This class provides a map() method. By default, this method does not have any processing. At this time, we can customize the map () method, inherit the Mapper class and rewrite the map() method.

        Next, let's take word frequency statistics as an example and customize the map() method. The code is as follows.

package cn.itcast.mr.wordcount;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        //接收传入进来的一行文本,并转换成String类型
        String line = value.toString();

        //将这行内容按分隔符空格切割成单词,保存在String数组中
        String[] words = line.split(" ");

        //遍历数组,产生<K2,V2>键值对,形式为<单词,1>
        for (String word : words
             ) {
            //使用context,把map阶段处理的数据发送给reduce阶段作为输入数据
            context.write(new Text(word), new IntWritable(1));
        }
    }
}

(3) Reducer component

        The key-value pairs output by the Map process will be merged by the Reducer component. Here we take word frequency statistics as an example, and customize the reduce() method.

package cn.itcast.mr.wordcount;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        //定义一个计数器
        int count = 0;

        //遍历一组迭代器,把每一个数量1累加起来构成了单词出现的总次数
        for (IntWritable iw : values
             ) {
            count +=iw.get();
        }

        //向上下文context写入<k3,v3>
        context.write(key, new IntWritable(count));
    }
}

(4) Combiner component

        The function of this component is to perform a merge calculation on the output duplicate data of the Map stage, and then use the new key-value pair as the input of the Reduce stage. If you want to customize the Combiner class, you need to inherit the Reducer class and rewrite the reduce() method. The specific code is as follows.

package cn.itcast.mr.wordcount;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class WordCountCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        //局部汇总
        //定义一个计数器
        int count = 0;

        //遍历一组迭代器,把每一个数量1累加起来构成了单词出现的总次数
        for (IntWritable v : values
             ) {
            count += v.get();
        }

        //向上下文context写入<k3,v3>
        context.write(key, new IntWritable(count));
    }
}

(5) Operation mode of MapReduce

        There are two operating modes of MapReduce, local operating mode and cluster operating mode.

a. Local running mode: Simulate the running environment of MapReduce in the current development environment, and the output results of the processed data are all locally.

b. Cluster operation mode: pack the MapReduce program into a jar package, upload it to the Yarn cluster for operation, and process the data and results on HDFS.

        Here we mainly talk about the local operation mode. To realize the local operation, we also need a Driver class, the code is as follows.

package cn.itcast.mr.wordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WordCountDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //通过 Job 来封装本次 MR 的相关信息
        Configuration conf = new Configuration();
        //配置MR运行模式,使用 local 表示本地模式,可以省略
        conf.set("mapreduce.framework.name","local");
        //获取 Job 运行实例
        Job wcjob = Job.getInstance(conf);
        //指定 MR Job jar运行主类
        wcjob.setJarByClass(WordCountDriver.class);
        //指定本次 MR 所有的 Mapper Combiner Reducer类
        wcjob.setMapperClass(WordCountMapper.class);
        wcjob.setCombinerClass(WordCountCombiner.class); //不指定Combiner的话也不影响结果
        wcjob.setReducerClass(WordCountReducer.class);
        //设置业务逻辑 Mapper 类的输出 key 和 value 的数据类型
        wcjob.setMapOutputKeyClass(Text.class);
        wcjob.setMapOutputValueClass(IntWritable.class);

        //设置业务逻辑 Reducer 类的输出 key 和 value 的数据类型
        wcjob.setOutputKeyClass(Text.class);
        wcjob.setOutputValueClass(IntWritable.class);

        //使用本地模式指定要处理的数据所在的位置
        FileInputFormat.setInputPaths(wcjob,"/home/huanganchi/Hadoop/实训项目/HadoopDemo/textHadoop/WordCount/input");
        //使用本地模式指定处理完成后的结果所保持的位置
        FileOutputFormat.setOutputPath(wcjob,new Path("/home/huanganchi/Hadoop/实训项目/HadoopDemo/textHadoop/WordCount/output"));

        //提交程序并且监控打印程序执行情况
        boolean res = wcjob.waitForCompletion(true);
        System.exit(res ? 0:1);
    }
}

         When we finish running, a result file will be generated locally.

2. Inverted index

(1 Introduction

        Inverted index is a commonly used data format structure in document retrieval systems, and is widely used in full-text search engines. It can be simply understood as finding documents based on content, rather than finding content based on documents.

(2) Implementation of the Map stage

package cn.itcast.mr.invertedIndex;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class InvertedIndexMapper extends Mapper<LongWritable, Text, Text, Text> {
    //存储单词和文档名称
    private static Text keyInfo = new Text();

    // 存储词频,初始化为1
    private static final Text valueInfo = new Text("1");

    /*
     * 在该方法中将K1、V1转换为K2、V2
     * key: K1行偏移量
     * value: V1行文本数据
     * context: 上下文对象
     * 输出: <MapReduce:file3 "1">
     */

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();

        // 得到单词数组
        String[] fields = line.split(" ");

        //得到这行数据所在的文件切片
        FileSplit fileSplit = (FileSplit) context.getInputSplit();

        //根据文件切片得到文件名
        String filename = fileSplit.getPath().getName();

        for (String field : fields
             ) {
            // key值由单词和文件名组成,如“MapReduce:file1”
            keyInfo.set(field + ":" + filename);
            context.write(keyInfo, valueInfo);
        }
    }
}

(3) Combine stage implementation

package cn.itcast.mr.invertedIndex;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class InvertedIndexCombiner extends Reducer<Text, Text, Text, Text> {
    private static Text info = new Text();
    // 输入: <MapReduce:file3 {1,1,...}>
    // 输出:<MapReduce file3:2>
    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        int sum = 0;  //统计词频
        //遍历一组迭代器,把每一个数量1累加起来构成了单词出现的总次数
        for (Text value : values) {
            sum += Integer.parseInt(value.toString());
        }
        int splitIndex = key.toString().indexOf(":");
        // 重新设置 value 值由文件名和词频组成
        info.set(key.toString().substring(splitIndex + 1) + ":" + sum);
        // 重新设置 key 值为单词
        key.set(key.toString().substring(0, splitIndex));

        //向上下文context写入<k3,v3>
        context.write(key, info);
    }
}

(4) Implementation of the Reduce phase

package cn.itcast.mr.invertedIndex;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class InvertedIndexReducer extends Reducer<Text, Text, Text, Text> {
    private static Text result = new Text();

    // 输入:<MapReduce, file3:2>
    // 输出:<MapReduce, file1:1;file2:1;file3:2;>
    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        // 生成文档列表
        StringBuffer fileList = new StringBuffer();
        for (Text value : values) {
            fileList.append(value.toString() + ";");
        }
        result.set(fileList.toString());
        context.write(key, result);
    }
}

(5) Driver program main class implementation

package cn.itcast.mr.invertedIndex;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class InvertedIndexDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //通过 Job 来封装本次 MR 的相关信息
        Configuration conf = new Configuration();
        //获取 Job 运行实例
        Job job = Job.getInstance(conf);
        //指定 MR Job jar运行主类
        job.setJarByClass(InvertedIndexDriver.class);
        //指定本次 MR 所有的 Mapper Combiner Reducer类
        job.setMapperClass(InvertedIndexMapper.class);
        job.setCombinerClass(InvertedIndexCombiner.class);
        job.setReducerClass(InvertedIndexReducer.class);
        //设置业务逻辑 Mapper 类的输出 key 和 value 的数据类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);

        //设置业务逻辑 Reducer 类的输出 key 和 value 的数据类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        //使用本地模式指定要处理的数据所在的位置
        FileInputFormat.setInputPaths(job,"/home/huanganchi/Hadoop/实训项目/HadoopDemo/textHadoop/InvertedIndex/input");
        //使用本地模式指定处理完成后的结果所保持的位置
        FileOutputFormat.setOutputPath(job,new Path("/home/huanganchi/Hadoop/实训项目/HadoopDemo/textHadoop/InvertedIndex/output"));

        //提交程序并且监控打印程序执行情况
        boolean res = job.waitForCompletion(true);
        System.exit(res ? 0:1);
    }
}

(6) Result display

3. Data deduplication

(1) Implementation of the Map stage

package cn.itcast.mr.dedup;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class DedupMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
    private static Text field = new Text();
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context)
            throws IOException, InterruptedException {
        field = value;
        context.write(field,NullWritable.get());
    }
}

(2) Implementation of the Reduce phase

package cn.itcast.mr.dedup;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class DeupReducer extends Reducer<Text, NullWritable, Text, NullWritable> {
    @Override
    protected void reduce(Text key, Iterable<NullWritable> values, Reducer<Text, NullWritable, Text, NullWritable>.Context context)
            throws IOException, InterruptedException {
        context.write(key, NullWritable.get());
    }
}

(3) Driver program main class implementation

package cn.itcast.mr.dedup;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

public class DedupDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException{
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        job.setJarByClass(DedupDriver.class);
        job.setMapperClass(DedupMapper.class);
        job.setReducerClass(DeupReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);
        FileInputFormat.setInputPaths(job, new Path("/home/huanganchi/Hadoop/实训项目/HadoopDemo/textHadoop/Dedup/input"));
        FileOutputFormat.setOutputPath(job, new Path("/home/huanganchi/Hadoop/实训项目/HadoopDemo/textHadoop/Dedup/output"));

        //job.waitForCompletion(true);
        boolean res = job.waitForCompletion(true);
        if (res) {
            FileReader fr = new FileReader("/home/huanganchi/Hadoop/实训项目/HadoopDemo/textHadoop/Dedup/output/part-r-00000");
            BufferedReader reader= new BufferedReader(fr);
            String str;
            while ( (str = reader.readLine()) != null )
                System.out.println(str);

            System.out.println("运行成功");
        }
        System.exit(res ? 0 : 1);
    }
}

(4) Results display

4、TopN

(1) Case introduction

        The TopN analysis method refers to the method of installing a certain indicator from the research object in reverse or forward order, taking the required N cases, and focusing on the analysis of the N data.

(2) Implementation of the Map stage

package cn.itcast.mr.topN;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;
import java.util.TreeMap;

public class TopNMapper extends Mapper<LongWritable, Text,
        NullWritable, IntWritable> {
    private TreeMap<Integer, String>repToRecordMap =
            new TreeMap<Integer, String>();
    @Override
    public void map (LongWritable key, Text value, Context context) {
        String line = value.toString();
        String[] nums = line.split(" ");
        for (String num : nums
             ) {
            repToRecordMap.put(Integer.parseInt(num), " ");
            if (repToRecordMap.size() > 5) {
                repToRecordMap.remove(repToRecordMap.firstKey());
            }
        }
    }

    @Override
    protected void cleanup(Context context) {
        for (Integer i: repToRecordMap.keySet()
             ) {
            try {
                context.write(NullWritable.get(), new IntWritable(i));
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
}

(3) Implementation of the Reduce phase

package cn.itcast.mr.topN;


import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.Comparator;
import java.util.TreeMap;

public class TopNReducer extends Reducer<NullWritable, IntWritable, NullWritable, IntWritable> {
    private TreeMap<Integer, String>repToRecordMap = new
            TreeMap<Integer, String>(new Comparator<Integer>() {
        public int compare(Integer a, Integer b) {
            return b-a;
        }
    });
    public void reduce(NullWritable key,
                       Iterable<IntWritable>values, Context context)
        throws IOException, InterruptedException {
        for (IntWritable value : values
             ) {
            repToRecordMap.put(value.get(), " ");
            if (repToRecordMap.size() > 5) {
                repToRecordMap.remove(repToRecordMap.lastKey());
            }
        }
        for (Integer i : repToRecordMap.keySet()
             ) {
            context.write(NullWritable.get(), new IntWritable(i));
        }
    }
}

(4) Driver program main class implementation

package cn.itcast.mr.topN;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.BufferedReader;
import java.io.FileReader;

public class TopNDriver {
    public static void main(String[] args) throws Exception{
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        job.setJarByClass(TopNDriver.class);
        job.setMapperClass(TopNMapper.class);
        job.setReducerClass(TopNReducer.class);
        job.setNumReduceTasks(1);
        job.setMapOutputKeyClass(NullWritable.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.setInputPaths(job, new Path("/home/huanganchi/Hadoop/实训项目/HadoopDemo/textHadoop/TopN/input"));
        FileOutputFormat.setOutputPath(job, new Path("/home/huanganchi/Hadoop/实训项目/HadoopDemo/textHadoop/TopN/output"));
        boolean res = job.waitForCompletion(true);
        if (res) {
            FileReader fr = new FileReader("/home/huanganchi/Hadoop/实训项目/HadoopDemo/textHadoop/TopN/output/part-r-00000");
            BufferedReader reader= new BufferedReader(fr);
            String str;
            while ( (str = reader.readLine()) != null )
                System.out.println(str);

            System.out.println("运行成功");
        }

        System.exit(res ? 0 : 1);
    }
}

(5) Result display


reference books

"Hadoop big data technology principle and application"

 

Guess you like

Origin blog.csdn.net/weixin_63507910/article/details/128540149