The second core component of Hadoop: MapReduce framework Section 3

9. Detailed knowledge of the core stages of MR program operation

1. The stages and functions involved in the operation of the MR program

InputFormat stage : two functions

  • Responsible for slicing the input data. The sliced ​​data corresponds to the number of MapTasks in the Mapper stage.
  • Responsible for how to convert the sliced ​​data into Key-value type data when MapTask reads the sliced ​​data, including the definition of the key-value data type.

Mapper stage

  • Functions to process the calculation logic of each slice data.
  • The execution method of the map method: a group of kv is executed once. A group of kv is a group of kv with one row of data in commonly used implementation classes.

Partitioner stage

  • The data processed in the map stage must be partitioned when it is output to the buffer overflow disk.

WritableComparable

  • Responsible for sorting the kv data output by the map.
  • Three sortings occur in the mr phase, the execution timing of each sorting and the algorithm used for sorting.

Combiner stage : can exist or not exist

  • It is equivalent to a Reducer, except that this reducer is valid for the current MapTask.
  • Partially summarize the data output in the map stage. Not all MR programs can add a Combiner, and the addition of a Combiner cannot affect the execution logic of the original MR program.

WritableComparator component (optional)

  • Responsible for grouping and sorting, reduce needs to group the data according to key values ​​after pulling it back. Which keys we consider to be the same set of keys can be determined through auxiliary sorting (grouping sorting).

Reducer stage

  • Its function is to aggregate all MapTask data and write calculation logic after aggregation.
  • Reduce mainly performs a global summary of the data output by the map, aggregates the value data with the same key value, and then calls the reduce method once for a set of the same key values.
  • We can set the number of reduceTask manually. When setting, pay attention to the relationship with the partition.

OutputFormat stage

  • The function is how the data output by the MR program is output to the final destination in the form of key-value.

2. The first component that the MR program runs: InputFormat

InputFormat is an abstract class that provides two abstract methods

  • getSplits: This method is used to calculate slices of input data files.

  • createRecordReader: This method determines whether MapTask reads slice data according to rows or according to other rules, including what the key-value means and what type it represents when reading; how to convert the read data into key-value format data.

Commonly used implementation classes of InputFormat: FileInputFormat (the default implementation class of InputFormat)

  • FileInputFormat is an input formatting class specifically used when reading file data, but FileInputFormat is also an abstract class.

  • The FileInputFormat abstract class has five commonly used non-abstract subclasses

    • TextInputFormat (the default implementation class of FileInputFormat)

      • How to slice (slicing mechanism):
      两个核心参数:MinSplitSize = 1L   MaxSplitSize = Long.MAX_VALUE
      configuration.set("mapreduce.input.fileinputformat.split.minsize",xxxL)
      configuration.set("mapreduce.input.fileinputformat.split.maxsize",xxxL)
      
      每一个输入文件单独进行切片,如果输入文件有N个,那么切片数量最少有N
      每一个文件先获取它的blockSize,然后计算文件的切片大小splitSize = Math.max(minSize, Math.min(maxSize, blockSize))
      先判断文件是否能被切片,如果文件是一个压缩包(.gz、.zip),单独成为一个切片,如果文件能被切片,判断文件的长度是否大于splitSize的1.1倍,如果不大于 文件单独成为一个切片,如果大于1.1倍,按照splitsize切一片,然后将剩余的大小和splitsize继续比较
          
      示例:
      第一种情况:  a.tar.gz    300M   blocksize 128M    只有一个数据切片300M
      第二种情况:blocksize均为128M
      a.txt   200M      两个切片:一个切片128M  第二个切片72M
      b.txt   130M      一个切片:130M  
          
      【注意】
          TextInputFormat是按照SplitSize进行切片的,默认情况下SplitSize=文件的BlockSize
          如果你要让SplitSize大于blockSize,那么我们需要在MR程序调整minsize的大小即可
          如果你要让SplitSize小于blockSize 那么需要MR程序调整maxSize的大小即可
      
      • How to read data into key-value (reading mechanism of kv data):
      TextInputFormat读取切片数据是按行读取,一行一行读取的,每一行数据以行的偏移量为key,以每一行的数据为value进行读取。
      
      行的偏移量指的是每一行的首字符在文件中的位置,位置是一个正整数,因此key是用LongWritable表示的,value因为代表的是每一行的数据,是个字符串,因此使用Text类型来表示。
      

    image-20230725111340088

    image-20230725111416072

    image-20230725111358951

    • KeyValueTextInputFormat

      • Slicing mechanism: It is exactly the same as TextInputFormat’s slicing mechanism
      • k - v data reading mechanism:
      按照一行一行的读取数据,每一行的数据以指定的分隔符分割这一行的数据,以分割之后的第一个字符串当作key值,剩余的字符串当作value值
      因此在这种机制下  key和value都是Text类型的
      
      如果使用KeyValueTextInputFormat,我们需要指定一个行的分隔符,如果没有指定,那么默认的分隔符的\t
      conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator","分隔符")
      

    image-20230725112144455

    image-20230725112214025

    image-20230725112020569

    image-20230725112045738

    • NLineInputFormat

      • slicing mechanism
      切片不是按照文件的大小和splitsize进行切片的,而是根据所有输入文件的行数进行切片的,每一个文件单独切片
      
      使用NLineInputFormat的时候,需要指定切片的行数
      NLineInputFormat.setNumLinesPerSplit(job,3)
      
      3个文件    指定3行一个切片
      a.txt  10行   4
      b.txt  12行   4
      c.txt  10行   4
      
      • The mechanism for kv to read data:
      和TextInputFormat一模一样
      以LongWirtable 每一行的偏移量为key
      以Text每一行的数据为value进行读取
      

    image-20230725113151032

    image-20230725113217728

    image-20230725113257208

    image-20230725113405443

    image-20230725113424921

    image-20230725113508459

    image-20230725113526768

    image-20230725113541441

    image-20230725113555566

    • CombineTextInputFormat: relatively frequently used

      • slicing mechanism
      适用于大量的小文件的场景
      
      #诞生背景
      不管是TextInputFormat还是KeyValueTextInputFormat、还是NLineInputFormat,在进行切片的时候都是每一个文件单独进行切片,也就意味着,如果输入文件有n个,切片数最小有n个。
      如果输入的文件都是一堆小文件,每一个文件只有几百kb,如果使用上述的切片机制,会产生很多的小切片,每一个切片就撑死几百KB,然后我们还得需要启动N个maptask运行。这就浪费资源了。大数据中,资源可是非常宝贵的东西,浪费可耻。
      MR程序一般情况MapTask处理的切片一般最好都在几百M左右,这样才不浪费资源。
      这个ConbineTextInputFormat使用大量小文件的切片规划,进行切片的时候,不是一个文件单独切片,而是根据容量进行切片,可能在一个切片中包含很多个小文件
      
      #切片规则
      ConbineTextInputFormat进行切片之前,需要指定一个容量--虚拟的切片容量(可以理解为切片容量)
      CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);//b
      切片的时候,每一个文件先按照虚拟的切片容量进行一次虚拟切片,虚拟切片机制如下:每一个文件判断,如果小于虚拟的切片容量,那么成为一个虚拟切片,如果文件大于虚拟切片容量但是小于虚拟切片的2倍,那么文件平均划分为两个虚拟切片,如果文件大于虚拟切片的2倍,那么按照虚拟切片的大小切一片,剩余的容量继续上述的判断
      假如指定虚拟切片容量是4M
                                   虚拟切片
      a.txt    4.1m                 2.05M 2.05M    
      b.txt    3m                   3M
      c.txt    10M                  4M 3M 3M
      将文件虚拟切片完成以后,我们将虚拟切片按照顺序累加起来,如果累加起来的容量大于设置的虚拟切片容量,单独成为一个物理切片,如果不大于的话,那么继续累加下一个切片,直到累加的结果大于等于设置的虚拟切片容量。
      虚拟切片:2.05M 2.05M 3m 4M 3M 3M  
      物理切片:4.1M 7M 6M
      如果想把所有文件放到一个切片中,则将虚拟切片容量改为100M
                                   虚拟切片
      a.txt    4.1M                 4.1M
      b.txt    3M                   3M
      c.txt    10M                  10M
      虚拟切片:4.1M 3m 10M  
      物理切片:17.1M
      
      • KV data reading mechanism
      和TextInputFormat一模一样的
      key  LongWritable类型的  每一行的偏移量
      value  Text类型  每一行的数据
      

    image-20230725115135086

    image-20230725115151976

    image-20230725115248645

    image-20230725115327351

    image-20230725115434440

    image-20230725115506588

    • SequenceFileInputFormat - not used yet

How to customize the InputFormat implementation class

自定义InputFormat(自己定义切片机制以及KV数据的读取规则)
	1、自定义一个类继承InputFormat
	2、重写getSplits方法
	3、重写createRecordReader方法

3. Source code analysis of the job submission process of the MR program

When the MR program is running, it first calculates the slices of the input file of the MR program (according to the specified InputFormat implementation class), generates a slice planning file job.split, and then generates a configuration file based on the configuration of the relevant Configuration of the MR program. job.xml, and then provide job.xml, job.split, job.jar (written MR program) to the resource scheduler.

When our MR program is running, we are currently running it on Windows and not in a big data cluster environment. Therefore, running on Windows is not distributed operation in the strict sense. It is just Windows simulating the running environment of MR. Therefore, if it is running on Windows If running on the MR program, you only need to provide job.split and job.xml files to a local directory when running .

At the same time, MR programs are generally run as a test on Windows (to see if there are any bugs in the code). If there are no bugs in the code, this is how we usually run MR programs in the enterprise. We put the MR program into a jar package and then upload it to On the node of the Hadoop cluster, then run the MR program through hadoop jar xxx.jar xxx.xxDriver. The MR program running in this way is run on YARN. At this time, the MR program is truly distributed. At the same time, if it is running If running on YARN , the job submits the job.split, job.xml, and job.jar files to HDFS.

1、底层会先识别我们的运行环境
2、生成一个资源提交目录,如果是本地运行模式,那么资源提交到本地的某个路径下,如果是YARN运行模式,那么资源提交给HDFS的某个路径,生成一个JobID
3、基于InputFormat的切片机制生成切片规划文件job.split文件,并且把文件写入到资源提交目录
4、将MR程序中所有的配置项写入到一个job.xml文件,文件也写入到资源提交目录
5、程序开始申请运行资源,运行Map任务和reduce任务

4. The role of Mapper component in running MR program

The Mapper stage is a core stage of MR program operation. It provides a map method. This map method will use the createRecordReader method provided by Inputformat to read the key-value data of the corresponding slice. The map method processes the key-value once for each kv data read. value, output a result to a memory buffer of MR.

The Mapper stage starts multiple MapTask tasks. Multiple MapTask tasks run in parallel and do not interfere with each other. The number of MapTask tasks corresponds to the number of slices. By default, under the slicing mechanism, a slice is a block.

Multiple MapTasks may run on multiple nodes, so on which nodes these MapTasks should be started and run. The MR program also has rules. Mobile data is not as good as mobile computing.
When starting a MapTask, it is generally required that the MapTask be started on the slice node that the MapTask is responsible for. In this case, our MapTask does not need to move the data when calculating the data. If the slicing node has no resources to start MapTask, then we will also start MapTask on the node closest to the data (network topology principle)

[Note] The reason why we can start computing tasks on the node where the slice is located is because when we configured the Hadoop cluster, DataNode and NodeManager were configured at the same time.

5. Shuffle stage during MR program running

Shuffle is a core of big data distributed computing, and it is also the core that affects the performance of big data distributed computing. Shuffle is also known as reshuffling (reshuffling the data, and then transmitting the data on different nodes and in the network).

If the amount of data that needs to be transmitted during the Shuffle process is too large, the efficiency of distributed computing will be low. In MapReduce, the Shuffle mechanism requires a large amount of disk IO (data is written from memory to disk) - disk IO is also a core factor affecting computing performance [Optimization
operations of MapReduce] to improve the computing efficiency of MR programs. Optimization is basically They are all optimized for the shuffle stage.

In the MapReduce program, the shuffle of MR's distributed computing program appears after the output of the map method and before the execution of the reduce method.

Detailed process of Shuffle stage work

Execution logic of Shuffle phase

  • Logic after execution of Map method
1、map方法输出kv数据时,先根据指定的Partitioner计算kv数据的分区,计算成功之后,将kv数据的分区编号、kv数据本身、key、value分别在内存的起始地址,key、value数据的长度等信息写入到一个内存的环形缓冲区中(100M)。
2、当环形缓冲区到达设定的阈值(80%),将环形缓冲区的数据溢写到磁盘文件,溢写数据之前,环形缓冲区的数据会根据不同分区进行一次分区排序(根据key值进行排序,默认使用快速排序算法),将排好序的分区数据溢写到磁盘文件中。
3、可能Map阶段进行多次溢写,每一次溢写都需要先在环形缓冲区进行分区排序,然后再溢写文件,每一次溢写都会产生一个新的溢写文件
4、如果溢写文件的数量的超过3个,那么就会触发自己设置的combiner操作,对已经溢写完成的数据先进行一次map端的聚合操作。Combiner操作可选的。
5、当map阶段执行完成,会将产生的多个溢写文件,以及环形缓冲区剩余的还没有溢写的数据进行一次合并操作,合并成为一个大文件,只不过再合并的时候也需要进行一次排序(排序也是基于每一个分区进行,基于key值大小,使用的排序算法是归并排序算法)。
6、归并排序生成大文件之后,还会进行一次自定义的Combiner操作,对map阶段输出的数据进行一次局部汇总。
【注意】Combiner操作可选的组件,如果加上的操作,第4和第6步就会执行,如果没有加,第4步和第6步一定不会执行
Combiner就算你指定了,可能一次也不执行,当map任务的计算负担很重,如果map任务的计算压力很大,那么combiner操作就算设置了,MR程序也不会执行的。
  • The logic before the Reduce method is executed
1、Copy阶段:Reduce任务根据负责的分区,从不同的MapTask上把对应的分区数据拉取到ReduceTask的内存中,如果ReduceTask内存放不下这些数据,把数据写到文件。
2、merge阶段:会把我们从不同maptask拉去回来的数据进行一次整体的合并。
3、sort阶段:合并拉取的不同mapTask分区的数据的时候,还需要对数据进行一次排序,排序可以单独指定规则,如果没有指定,默认还是使用key值的大小规则,排序算法也是归并排序。

Map output in Shuffle begins to perform source code interpretation

1、collecotr收集器往环形缓冲区写出数据,只不过写出数据的时候先根据Partitioner计算数据的分区,partitioner分区计算默认情况下有两种计算方式
	1、如果reduceTask的数量等于1的时候,采用一个内部类的分区器进行分区,分区器是把所有的数据都分配到0号分区
	2、如果reduceTask的数量大于1的时候,采用一个HashPartitioner分区机制,按照key的hashcode值和Integer.MAX_VALUE进行一次&位运算,然后和reduceTask取%余数得到一个分区编号。
分区编号看ReduceTask数量,[0,reduceTask-1]
	如果你想自己控制分区的数据,那么就得需要自定义Partitioner来完成

2、collector将数据写入环形缓冲区,环形缓冲区代码的体现就是一个字节数组,字节数组默认100M,超过80M,需要把缓冲区的数据写入到一个文件中
缓冲区可以设置大小,阈值可以设置
mapreduce.task.io.sort.mb	100	    指定MR程序运行中环形缓冲区的默认大小  100M
mapreduce.map.sort.spill.percent	0.80   指定MR程序运行中缓冲区的阈值 默认是0.8
也可以再mapred-site.xml配置,如果在这个文件配置了,以后所有在Hadoop集群上运行的MR程序的缓冲区和阈值都是配置文件的值了。但是这样的配置我们不建议。

因为不同的计算程序环形缓冲区和阈值配置不同的参数,因此一般在MR的驱动程序使用Configuration配置,虽然这个配置只是对当前的MR生效。但是这是最常用的。

配置有个规则:缓冲区越大,溢写的次数越小,计算的速度越高。

Partitioning problem of MapTask output

When MapTask outputs data, it first calculates the partition of the kv data. After calculating the partition number, the kv and partition number are then written out to the ring buffer with the help of the collector.

The mechanism for calculating partitions:

  • If the number of partitions (the bottom layer is NumReduceTask) is equal to 1: the bottom layer will use an anonymous inner class of Partitioner to calculate the partition number, and the calculation logic will directly return 0.

image-20230727104204223

  • If the number of partitions is greater than 1: the bottom layer will use the partition class we set in the Driver. If the partition class is not set, the HashPartitioner class will be used by default for partitioning.

image-20230727104210497

The relationship between the number of partitions and NumReduceTask:

  • A ReduceTask can only process the data of one partition, so in principle the number of ReduceTasks and the number of partitions must be consistent.
  • If the number of partitions and the number of ReduceTasks are inconsistent, the following three situations may occur:
    • The number of customized partitions is greater than 1, but the number of ReduceTasks is equal to 1. At this time, the program runs normally and returns a bottom-level anonymous internal class partitioner for partitioning. All data goes to partition 0, and the custom partition class does not Any use.
    • If the number of customized partitions is greater than 1, and the number of ReduceTasks is greater than 1 but less than the number of partitions, an error will be reported when running the program.
    • If the customized partition data is greater than 1, the number of ReduceTasks is greater than 1 and greater than the number of customized partitions, the program will run normally and run according to our own defined partitioning mechanism, but the redundant ReduceTasks will run empty.

In the Shuffle stage, the definition of data partition rules is implemented through custom partitions.

1、定义Java类继承Partitioner类
2、重写Partitioner类中getPartition 方法自定义分区规则即可
分区的数量必须和ReduceTask的数量保持一致,如果两者不一致,出现以下三种情况
	1、reduceTask的数量大于分区数,那么会产生多个结果文件,只不过有些结果文件就是一个空白文件,多余的reduceTask没有分区数据处理才会产生空白文件
	2、reduceTask的数量小于分区数,而且大于1的,报错
	3、reduceTask的数量小于分区数,但是等于1  正常执行,只不过分区不执行了
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.Random;

/**
 * 现在想使用mapreduce去实现单词技术案例,案例需求:
 *      1、要求可以统计出输入文件中每一个单词出现的总次数
 *      2、要求输出文件有两个,其中如果单词的首字母是大写,那么单词的统计结果写出到part-r-00000文件
 *          如果单词的首字母是小写,那么单词的统计结果写出到part-r-00001文件中
 *
 * 逻辑实现:
 *      因为结果需要两个文件,因此我们需要两个ReduceTask(因为MR程序中一个reduceTask默认只输出一个文件)
 *      而且现在我们还指定了分区的数据规则,MR程序的默认分区机制无法满足我们的需求,因此我们还需要自定义分区机制
 *      剩余的操作就是基本的求单词计数案例的代码
 */
public class WCDriver02 {
    
    
    public static void main(String[] args) throws IOException, URISyntaxException, InterruptedException, ClassNotFoundException {
    
    
        Configuration configuration = new Configuration();
        //configuration.set("fs.defaultFS","hdfs://192.168.31.104:9000");

        Job job = Job.getInstance(configuration);
        //这一行代码用来指定程序打成JAR包之后在集群中运行时避免ClassNotFound异常问题
        job.setJarByClass(WCReducer02.class);

        job.setInputFormatClass(TextInputFormat.class);
        FileInputFormat.setInputPaths(job,new Path("/wordcount.txt"));

        job.setMapperClass(WCMapper02.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        //封装分区机制  Partitioner
        job.setPartitionerClass(WCPartitioner.class);

        job.setReducerClass(WCReducer02.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        /**
         * 设置reduceTask的数量,默认情况下 我们规定ReduceTask的数量必须和自定义的分区数保持一致
         * 但是规定是规定 是可以打破的,但是打破规则是要接受代价的
         * 代价:如果reduceTask的数量和分区返回的数量不一致,会出现以下三种情况:
         * 1、reduceTask的数量大于分区数,那么会产生多个结果文件,只不过有些结果文件就是一个空白文件,多余的reduceTask没有分区数据处理才会产生空白文件
         * 2、reduceTask的数量小于分区数,而且大于1的,报错
         * 3、reduceTask的数量小于分区数。但是等于1 正常执行,只不过分区不执行
         */
        job.setNumReduceTasks(2);

        //job.setOutputFormatClass();
        Path path = new Path("/output");
        FileSystem fs = FileSystem.get(new URI("hdfs://192.168.31.104:9000"), configuration, "root");
        if (fs.exists(path)){
    
    
            fs.delete(path,true);
        }
        FileOutputFormat.setOutputPath(job,path);

        boolean b = job.waitForCompletion(true);
        System.exit(b?0:1);
    }
}
class WCMapper02 extends Mapper<LongWritable, Text,Text,LongWritable>{
    
    
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws IOException, InterruptedException {
    
    
        String line = value.toString();
        String[] words = line.split(" ");
        for (String word : words) {
    
    
            context.write(new Text(word),new LongWritable(1l));
        }
    }
}

class WCReducer02 extends Reducer<Text,LongWritable,Text,LongWritable>{
    
    
    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
    
    
        long sum = 0l;
        for (LongWritable value : values) {
    
    
            sum += value.get();
        }
        context.write(key,new LongWritable(sum));
    }
}

/**
 * 自定义Partitioner实现数据分区机制
 * 1、自定义的Partitioner需要传递两个泛型,两个泛型就是map阶段输出的key-value的类型,
 *      因为partitioner分区是map阶段输出数据的时候触发的
 * 2、重写getPartitioner方法
 */

class WCPartitioner extends Partitioner<Text,LongWritable>{
    
    
    /**
     *
     * @param key      map阶段输出的key值
     * @param value    map阶段输出的value值
     * @param numPartitions 设置的reduceTask的数量
     * @return         返回值整数类型  整数代表的是数据的分区编号 分区编号从0开始 而且分区编号必须是连贯的
     */
    @Override
    public int getPartition(Text key, LongWritable value, int numPartitions) {
    
    
        /**
         * 分区逻辑是  如果单词的首字母是大写 那么把数据分配给0号分区处理
         * 如果单词的首字母是小写 那么把数据分配给1号分区处理
         * 分区的编号从0开始
         */
        String word = key.toString();
        char first = word.charAt(0);
        if (first >= 65 && first <=90){
    
    
            return 0;
        } else {
    
    
            return 1;
        }
    }
}

image-20230726154210195

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * 要求:实现单词计数,只不过要求:
 * 如果单词的首字母是h或者H  那么结果输出到0号分区
 * 如果单词的首字母是s或者S  那么结果输出到1号分区
 * 如果单词的首字母不是上述的情况,那么结果输出到2号分区
 */
public class WCDriver03 {
    
    
    public static void main(String[] args) throws IOException, URISyntaxException, InterruptedException, ClassNotFoundException {
    
    
        Configuration configuration = new Configuration();

        Job job = Job.getInstance(configuration);

        job.setInputFormatClass(TextInputFormat.class);
        FileInputFormat.setInputPaths(job,new Path("/wordcount.txt"));

        job.setMapperClass(WCMapper03.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setPartitionerClass(WCPartitioner03.class);

        job.setReducerClass(WCReducer03.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        job.setNumReduceTasks(3);


        Path path = new Path("/output");
        FileSystem fs = FileSystem.get(new URI("hdfs:/192.168.31.104:9000"), configuration, "root");
        if (fs.exists(path)){
    
    
            fs.delete(path,true);
        }
        FileOutputFormat.setOutputPath(job,path);

        boolean flag = job.waitForCompletion(true);
        System.exit(flag?0:1);
    }
}

class WCMapper03 extends Mapper<LongWritable, Text,Text,LongWritable>{
    
    
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws IOException, InterruptedException {
    
    
        String line = value.toString();
        String[] words = line.split(" ");
        for (String word : words) {
    
    
            context.write(new Text(word),new LongWritable(1l));
        }
    }
}

class WCReducer03 extends Reducer<Text,LongWritable,Text,LongWritable>{
    
    
    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
    
    
        long sum = 0l;
        for (LongWritable value : values) {
    
    
            sum += value.get();
        }
        context.write(key,new LongWritable(sum));
    }
}

class WCPartitioner03 extends Partitioner<Text,LongWritable>{
    
    
    @Override
    public int getPartition(Text key, LongWritable value, int numPartitions) {
    
    
//        String word = key.toString();
//        char first = word.charAt(0);
//        if (first == 72 || first == 104){
    
    
//            return 0;
//        } else if (first == 83 || first == 115) {
    
    
//            return 1;
//        }else {
    
    
//            return 2;
//        }

        String word = key.toString();
        //首字母是全部转成小写形式,这样的话我们就可以实现类似于忽略大小写判断的规则
        char first = word.toLowerCase().charAt(0);
        if (first == 'h'){
    
    
            return 0;
        } else if (first =='s') {
    
    
            return 1;
        }else {
    
    
            return 2;
        }
    }
}

image-20230726162548852

image-20230726161810847

MapTask output ring buffer problem

当我们计算完成kv数据的分区之后,MR程序会借助collector收集器的collect方法将kv数据以及分区编号向环形缓冲区写入,环形缓冲区是一个内存中概念,在底层源码当中就是一个字节数组byte[]  kvbuffer。

环形缓冲区默认只有100M,而且环形缓冲区还有一个阈值,阈值默认是80%,如果缓冲区写入的数据超过了阈值,缓冲区的已经写入的数据会溢写到磁盘文件中spliiN.out文件,同时溢写的过程中,会在环形缓冲区剩余的20%的空间反向继续写入后续的MapTask计算完成的数据。

环形缓冲区大小和阈值可以自己设置的:
mapreduce.task.io.sort.mb    100M    设置环形缓冲区的大小
mapreduce.map.sort.spill.percent     0.8  设置环形缓冲区的溢写因子

【优化机制】如果想让MR程序执行的更加快速,在缓冲区这块我们可以减少溢写磁盘的次数,因此一般情况下对于不同的计算程度可以设置缓冲区的大小和阈值,减少溢写次数。

In the Shuffle stage, custom sorting rules are used to ensure that the output results are in order.

整体Shuffle阶段,一共对数据进行三次排序,而且最终输出结果文件里面的数据其实是有顺序的。三次排序分别发生在:
1、当环形缓冲区超过阈值之后溢写磁盘的时候,会先在环形缓冲区进行第一次排序操作,排序基于key值的比较器进行排序,底层采用的快速排序的算法。
2、当map阶段产生了多个溢写文件之后,合并多个溢写文件以及缓冲区中的数据的之后会进行第二次排序操作,排序基于key值得比较器进行排序的,底层采用是归并排序的算法
3、当ReduceTask把它所负责的分区数据拉去到ReduceTask节点之后,也需要对拉取的多个MapTask上的数据在进行一次归并排序,默认情况下我们排序也是基于key值的比较器进行排序,但是reduce比较特殊,也可以单独指定另外一种排序规则。
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.Objects;
import java.util.Random;

/**
 * 计算完成了每一个手机消耗的上行流量信息、下行流量信息、总流量信息,
 * 现在我们要求在上次计算的结果文件基础之上,实现将计算结果中的所有信息按照总流量信息从高到低进行排序
 *
 * 输入文件格式:
 * 13480253104	180	180	360
 * 13502468823	7335	110349	117684
 * 13560436666	2481	24681	27162
 * 13560436666	3597	25635	2070
 * 13560439658	918	4938	5856
 * 13560439658	2034	5892	2070
 * 13602846565	1938	2910	4848
 * 13660577991	6960	690	7650
 * 13719199419	240	0	240
 * 13726230503	2481	24681	27162
 * 13760778710	120	120	240
 * 13826544101	264	0	264
 * 13922314466	3008	3720	6728
 * 13925057413	11058	48243	59301
 * 13926251106	240	0	240
 * 13926435656	132	1512	1644
 * 15013685858	3659	3538	7197
 * 15920133257	3156	2936	6092
 * 15989002119	1938	180	2118
 * 18211575961	1527	2106	3633
 * 18320173382	9531	2412	11943
 * 84138413	4116	1432	5548
 * 案例逻辑实现:如果实现了排序,只需要把数据当作map阶段的输出key传递即可完成排序 key的排序规则定义好
 *      1、通过map阶段读取文件数据,数据封装成为JavaBean,然后以JavaBean为key,以null为value
 */
public class FlowDriver {
    
    
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException, URISyntaxException {
    
    
        Configuration configuration = new Configuration();

        Job job = Job.getInstance(configuration);
        job.setJarByClass(FlowDriver.class);
        job.setInputFormatClass(TextInputFormat.class);
        //输出路径是以前统计的流量信息的输出结果文件
        FileInputFormat.setInputPaths(job,new Path("/output/part-r-00000"));

        job.setMapperClass(FlowMapper.class);
        job.setMapOutputKeyClass(FlowBean.class);
        job.setMapOutputValueClass(NullWritable.class);
        job.setReducerClass(FlowReducer.class);
        job.setOutputKeyClass(FlowBean.class);
        job.setOutputValueClass(NullWritable.class);
        job.setNumReduceTasks(1);

        Path path = new Path("/output1");
        FileSystem fs = FileSystem.get(new URI("hdfs://192.168.31.104:9000"),configuration, "root");
        if (fs.exists(path)){
    
    
            fs.delete(path);
        }

        FileOutputFormat.setOutputPath(job,path);

        boolean flag = job.waitForCompletion(true);
        System.exit(flag?0:1);
    }
}

/**
 * map的计算逻辑:
 *      把统计好的流量信息文件读取进行,每一行数据以\t分割得到每一个字段,将字段以FlowBean进行封装
 *      以FlowBean为key 以null为value输出reduce
 *      这样的话 最后MR统计完成的数据必然是有顺序
 */
class FlowMapper extends Mapper<LongWritable, Text,FlowBean, NullWritable>{
    
    
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, FlowBean, NullWritable>.Context context) throws IOException, InterruptedException {
    
    
        String line = value.toString();
        String[] array = line.split("\t");
        String phoneNumber = array[0];
        Long upFlow = Long.parseLong(array[1]);
        Long downFlow = Long.parseLong(array[2]);
        Long sumFlow = Long.parseLong(array[3]);
        FlowBean flowBean = new FlowBean(phoneNumber,upFlow,downFlow,sumFlow);
        context.write(flowBean,NullWritable.get());
    }
}
class FlowReducer extends Reducer<FlowBean,NullWritable,FlowBean,NullWritable>{
    
    
    @Override
    protected void reduce(FlowBean key, Iterable<NullWritable> values, Reducer<FlowBean, NullWritable, FlowBean, NullWritable>.Context context) throws IOException, InterruptedException {
    
    
        context.write(key,NullWritable.get());
    }
}


/**
 * JavaBean一会当作MapReduce的map阶段的key值进行传输,必须实现序列化  而且必须实现比较器
 * 而且还要求了排序规则,因此比较器不能随便写
 */
class FlowBean implements WritableComparable<FlowBean> {
    
    
    private String phoneNumber;
    private Long upFlow;
    private Long downFlow;
    private Long sumFlow;

    public FlowBean() {
    
    

    }

    public FlowBean(String phoneNumber, Long upFlow, Long downFlow, Long sumFlow) {
    
    
        this.phoneNumber = phoneNumber;
        this.upFlow = upFlow;
        this.downFlow = downFlow;
        this.sumFlow = sumFlow;
    }

    public String getPhoneNumber() {
    
    
        return phoneNumber;
    }

    public void setPhoneNumber(String phoneNumber) {
    
    
        this.phoneNumber = phoneNumber;
    }

    public Long getUpFlow() {
    
    
        return upFlow;
    }

    public void setUpFlow(Long upFlow) {
    
    
        this.upFlow = upFlow;
    }

    public Long getDownFlow() {
    
    
        return downFlow;
    }

    public void setDownFlow(Long downFlow) {
    
    
        this.downFlow = downFlow;
    }

    public Long getSumFlow() {
    
    
        return sumFlow;
    }

    public void setSumFlow(Long sumFlow) {
    
    
        this.sumFlow = sumFlow;
    }

    @Override
    public boolean equals(Object o) {
    
    
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false;
        FlowBean flowBean = (FlowBean) o;
        return Objects.equals(phoneNumber, flowBean.phoneNumber) && Objects.equals(upFlow, flowBean.upFlow) && Objects.equals(downFlow, flowBean.downFlow) && Objects.equals(sumFlow, flowBean.sumFlow);
    }

    @Override
    public int hashCode() {
    
    
        return Objects.hash(phoneNumber, upFlow, downFlow, sumFlow);
    }

    @Override
    public String toString() {
    
    
        return phoneNumber + "\t" + upFlow + "\t" + downFlow + "\t" + sumFlow;
    }

    /**
     * 要求按照总流量进行降序排序
     *      前者大于后者 返回-1 小于后者 返回1  降序
     * @param o the object to be compared.
     * @return
     */
    @Override
    public int compareTo(FlowBean o) {
    
    
        if (this.sumFlow > o.sumFlow){
    
    
            return -1;
        } else if (this.sumFlow < o.sumFlow) {
    
    
            return 1;
        }else {
    
    
            return 0;
        }
    }

    @Override
    public void write(DataOutput out) throws IOException {
    
    //序列化写
        out.writeUTF(phoneNumber);
        out.writeLong(upFlow);
        out.writeLong(downFlow);
        out.writeLong(sumFlow);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
    
    //反序列化读
            phoneNumber = in.readUTF();
            upFlow = in.readLong();
            downFlow = in.readLong();
            sumFlow = in.readLong();
    }
}

image-20230726185111798

image-20230726185146248

When reducing and aggregating data in the shuffle phase, which data have the same key value?

In addition to using the hashCode and equals methods of the custom type, you also need to judge by a comparator.

Change the above code example to - do not return 0 when writing comparators in the future; if 0 is returned, big data will think that the objects corresponding to the two parameters compared by the comparator are the same object.

 @Override
    public int compareTo(FlowBean02 o) {
    
    
        if (this.sumFlow > o.sumFlow){
    
    
            return 1;
        } else {
    
    
            return -1;
        }
    }

Problem with overflow sorting of MapTask output

环形缓冲区在进行溢写的时候,会先对环形缓冲区的数据按照不同的分区,按照分区的key值的比较器进行排序,排序的目的保证溢写文件分区有序。因此在MR程序当中,要求MapTask输出的key值必须实现WritableComparable接口,并且重写序列化和反序列化机制以及比较器方法,同时在比较器方法中重写比较规则。MapReduce溢写文件的时候,是一次性全部写入的,全部溢写完成以后清空环形缓冲区的溢写数据。

溢写磁盘的时候,每一次溢写都会进行一次排序,溢写的排序底层默认使用的是快速排序算法实现的。

MapTask运行过程中,可能产生多个溢写文件,最后多个溢写文件合并成一个大的溢写文件,合并大的溢写文件的时候,还得需要进行一次排序操作,排序采用的归并排序算法。

【问题】重写比较器的比较方法时,一定要注意,比较器返回的值只能是正整数或者负整数,但千万不能是0,因为一旦是0,那么两个相等的数据只会保留一个。

Combiner operation in Shuffle stage (optional component of MR program)

Combiner其实也是一个Reducer,只不过和Reducer不一样的地方在于,Reducer是对所有的MapTask计算的结果进行聚合操作,Combiner只对当前的MapTask计算的结果进行一次局部汇总,目的是为了减少了Map阶段向Reduce阶段传输的数据量,从而提升MR程序的计算效率。

Combiner的使用规则:
	1、一般默认情况下,Combiner就是Reducer,Reducer可以当作Combiner来使用。
	2、如果你不想用Reducer充当Combiner,也可以自定义Combiner,如果自定义Combiner,那么必须满足以下要求:
		1、自定义的Combiner的类必须继承Reducer。
		2、Combiner的输入的KV是map阶段输出的kv类型
		  Combiner输出的kv类型必须是Reducer阶段输入的key  value类型。
	使用在Map阶段给Reduce阶段传输的数据量过大的情况下,可以使用Combine进行一次map的局部汇总,减少数据的传输量。
	
Combiner的执行时机:
	1、当Map阶段的的溢写文件超过三个,自动触发Combiner操作
	2、当Map阶段执行完成之后,把所有的溢写文件合并之后也会触发一次Combiner操作
	3、Combiner在有些极端的情况下,就算我们设置了,它也可能不会执行,如果map端的计算压力过大,那么Combiner就不会执行了,而是直接执行Reducer

How to add a custom Combiner

class WCCombiner extends Reducer<Text,LongWritable,Text,LongWritable>{
    
    
    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
    
    
        Long sum = 0L;
        for (LongWritable value : values) {
    
    
            sum += value.get();
        }
        context.write(key,new LongWritable(sum));
    }
}
public class WCDriver {
    
    
    public static void main(String[] args) throws IOException, URISyntaxException, InterruptedException, ClassNotFoundException {
    
    
        job.setCombinerClass(WCCombiner.class);
    }
}

image-20230727100310655

Combiner local aggregation problem when MapTask outputs data

Combiner是MapReduce的可选组件,可以添加也可以不加,如果我们添加了Combiner,Combiner是在map输出之后,reduce输入之前执行的,Combiner在这个过程中,执行几次,执行时机都是不确定的。和MR程序的计算负载,资源是有很大关系的,有可能Combiner设置了,一次也不执行。

Combiner可以理解为Map端的局部聚合,Combiner的存在,可以减少Map端的溢写文件的数据量以及Map向Reduce传输的数据量。

并不是所有的MR程序都可以添加Combiner,Combiner使用的前提是不能影响MR程序的执行逻辑。如果使用MapReduce程序计算平均值等操作,Combiner一定不能存在。

如果我们要自定义Combiner,Combiner的输入和输出的kv类型必须和Map阶段的输出类型保持一致。Combiner其实就是一个Reducer。所以在有些情况下,如果Combiner和Reducer的逻辑是一样的,同时Reducer的输入和输出满足Combiner的要求,那么可以使用Reducer充当Combiner使用。

The problem of group sorting of data pulled by ReduceTask

ReduceTask是和分区一一对应的,一个ReduceTask用来处理一个分区的数据,ReduceTask处理的分区数据可能是来自多个MapTask,因此ReduceTask在进行计算之前,需要先进行一个copy阶段,copy阶段主要是将每一个MapTask上该分区的数据拉去到ReduceTask所在的节点上。默认拉去到ReduceTask内存中,如果内存放不下Spill溢写操作。

因为ReduceTask拉取的数据量可能很大,拉取的过程中也会对数据进行merge合并操作。

ReduceTask把数据拉取合并完成之后,需要进行分组以及排序,排序merge合并完成之后,需要对整体的拉取的数据再进行一次归并排序,分组将该分区的数据按照key值划分不同的数据组,然后一组相同的key值调用一次Reduce方法进行处理。

排序默认使用的是key值的比较器进行排序的,分组默认基于key值的判断相等(hashCode、equals)策略进行key值相等判断以外,还会借助比较器进行key值的相等判断,如果hashCode和equals判断两个key值相等,但是比较器比较出来两个对象大小不一致,那么此时MR程序也会认为两个key值不等,划分不同的组中。
MR程序中判断相等比较器是主力。

默认情况下,reduce聚合key值的时候,需要对key值进行分组,但是key值分组的时候默认使用的是map阶段输出key值的比较器进行相等判断。但是在有些情况下,我们reduce聚合key值并不是按照map阶段的key值的比较器进行分组,因此我们就需要在Reduce阶段在单独定义分组排序,分组排序的目的是为了告诉reduce你应该如何进行分组。

reduce端的分组排序如果我们要自定义,只需要继承一个类即可WritableComparator
/**
 * Reduce判断两个key值是否相等的一个核心
 */
class FlowGroupComparator extends WritableComparator {
    
    
    /**
     * 创建一个分组排序的构造器
     */
    protected FlowGroupComparator(){
    
    
        //这一行代码代表分组排序的key值是flowbean类型的
        super(FlowBean02.class,true);
    }
    /**
     * 方法就是Reduce判断key值是否相等的核心逻辑
     * 判断两个FlowBean是否相等,基于手机号的判断
     * @param a     map阶段输出的key值
     * @param b     map阶段输出的另外一个key值
     * @return
     */
    @Override
    public int compare(WritableComparable a, WritableComparable b) {
    
    
        FlowBean02 f1 = (FlowBean02) a;
        FlowBean02 f2 = (FlowBean02) b;
        if (f1.getPhoneNumber().equals(f2.getPhoneNumber())){
    
    
            return 0;
        }else {
    
    
            return 1;
        }
    }
}
public class FlowDriver02 {
    
    
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException, URISyntaxException {
    
    
         job.setGroupingComparatorClass(FlowGroupComparator.class);
    }
}

Case implementation of group-assisted sorting of data pulled by ReduceTask

package com.kang.group;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;


/**
 * MR程序辅助排序(分组排序)的案例:
 *      辅助程序是reduce拉取完数据之后执行,通过辅助排序,reduce可以判断哪些key值为相同的key,如果没有辅助排序,那么MR程序会使用map阶段输出的key排序规则当作key值判断相等的条件
 *
 *  现在有一个订单文件,格式如下:
 *  订单id	商品id	成交金额
 * 0000001	Pdt_01	222.8
 * 0000001	Pdt_05	25.8
 * 0000002	Pdt_03	522.8
 * 0000002	Pdt_04	122.4
 * 0000002	Pdt_05	722.4
 * 0000003	Pdt_01	222.8
 * 0000003	Pdt_02	33.8
 * 这个文件三列,每一列之间都是以\t分割的。现在我们需要基于上述的文件求每一个订单中成交金额最大的商品。结果如下:
 * 0000001	Pdt_01	222.8
 * 0000002	Pdt_05	722.4
 * 0000003	Pdt_01	222.8
 *
 * 案例分析:如果我们只是想把订单数据按照订单编号从低到高排序,同时如果订单编号一致,那么按照成交金额从高到低排序。
 * 到时候只需要按照订单ID分组,取第一条数据,第一条数据就是我们某一个订单中成交金额最大的商品信息
 *
 * 如果我们要获取每一个订单的成交金额最大的信息,逻辑只需要在刚刚代码基础上,reduce在进行汇总数据的时候,重新指定一下分组规则即可
 * 分组条件只要是订单id一致即可。如果订单id一致,多个订单数据只有第一条数据才会进入reduce
 */
public class OrderDriver {
    
    
    public static void main(String[] args) throws IOException, URISyntaxException, InterruptedException, ClassNotFoundException {
    
    
        Configuration configuration = new Configuration();
        configuration.set("fs.defaultFS","hdfs://192.168.31.104:9000");

        Job job = Job.getInstance(configuration);
        job.setJarByClass(OrderDriver.class);
        job.setInputFormatClass(TextInputFormat.class);
        FileInputFormat.setInputPaths(job,new Path("/group.txt"));

        job.setMapperClass(OrderMapper.class);
        job.setMapOutputKeyClass(OrderBean.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setGroupingComparatorClass(OrderGroupComparator.class);

        job.setReducerClass(OrderReducer.class);
        job.setOutputKeyClass(OrderBean.class);
        job.setOutputValueClass(NullWritable.class);
        job.setNumReduceTasks(1);

        Path path = new Path("/orderOutput");
        FileSystem fs = FileSystem.get(new URI("hdfs://192.168.31.104:9000"), configuration, "root");
        if (fs.exists(path)){
    
    
            fs.delete(path,true);
        }
        FileOutputFormat.setOutputPath(job,path);

        boolean flag = job.waitForCompletion(true);
        System.exit(flag?0:1);
    }
}

/**
 * map阶段的逻辑就是把每一行的订单数据读取进来 以后,按照\t分割 ,将每一行的数据字段以orderBean对象封装,
 * 封装好以后以orderBean为key 以null值为value输出即可
 * 那么MR程序在计算过程中会自动根据OrderBean定义的排序规则对数据进行排序
 */
class OrderMapper extends Mapper<LongWritable, Text,OrderBean, NullWritable>{
    
    
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, OrderBean, NullWritable>.Context context) throws IOException, InterruptedException {
    
    
        String line = value.toString();
        String[] message = line.split("\t");
        String orderId = message[0];
        String pId = message[1];
        Double amount = Double.parseDouble(message[2]);
        OrderBean orderBean = new OrderBean(orderId,pId,amount);
        context.write(orderBean,NullWritable.get());
    }
}

/**
 * Reducer阶段:reduce阶段只需要将读取进来的key value数据输出即可,因为排序规则在map到reduce中间的shuffle阶段已经全自动化完成了
 * 因此如果只是排序规则,到了reduce阶段只需要将处理好的数据原模原样的输出即可
 */
class OrderReducer extends Reducer<OrderBean,NullWritable,OrderBean,NullWritable>{
    
    
    @Override
    protected void reduce(OrderBean key, Iterable<NullWritable> values, Reducer<OrderBean, NullWritable, OrderBean, NullWritable>.Context context) throws IOException, InterruptedException {
    
    
        context.write(key,NullWritable.get());
    }
}
/**
 * 定义辅助排序,重新定于reduce的key值分组逻辑:
 *  如果orderId一致 认为两条数据是同一个可以 reduce聚合的时候使用第一条数据当作key值进行计算
 */
class OrderGroupComparator extends WritableComparator{
    
    
    public OrderGroupComparator(){
    
    
        super(OrderBean.class,true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
    
    
        OrderBean o1 = (OrderBean) a;
        OrderBean o2 = (OrderBean) b;
        return o1.getOrderId().compareTo(o2.getOrderId());
    }
}
/**
 * 因为你要对数据进行排序,排序的规则还涉及到多个不同的字段,MR程序中只有map阶段输出的key才具备排序的能力
 * 因此也就意味着多个排序字段都要当作map输出key来传递,但是key值只能传递一个,多个字段封装为一个JavaBean
 * 1、定义一个封装原始数据的JavaBean类
 */
class OrderBean implements WritableComparable<OrderBean>{
    
    

    private String orderId;//订单编号
    private String pId;//商品编号
    private Double amount;//成交金额

    public OrderBean() {
    
    
    }

    public OrderBean(String orderId, String pId, Double amount) {
    
    
        this.orderId = orderId;
        this.pId = pId;
        this.amount = amount;
    }

    public String getOrderId() {
    
    
        return orderId;
    }

    public void setOrderId(String orderId) {
    
    
        this.orderId = orderId;
    }

    public String getpId() {
    
    
        return pId;
    }

    public void setpId(String pId) {
    
    
        this.pId = pId;
    }

    public Double getAmount() {
    
    
        return amount;
    }

    public void setAmount(Double amount) {
    
    
        this.amount = amount;
    }

    @Override
    public String toString() {
    
    
        return orderId + "\t" + pId + "\t" + amount;
    }

    /**
     * 比较器:想按照订单id升序排序,订单id一致按照成交金额降序排序
     * @param o the object to be compared.
     * @return
     */
    @Override
    public int compareTo(OrderBean o) {
    
    
        if (this.orderId.compareTo(o.orderId) == 0){
    
    
            //判断成交金额
            if (this.amount > o.amount){
    
    
                return -1;
            } else if (this.amount < o.amount) {
    
    
                return 1;
            } else {
    
    
                return 0;
            }
        }else {
    
    
            return this.orderId.compareTo(o.orderId);
        }
    }

    @Override
    public void write(DataOutput out) throws IOException {
    
    
        out.writeUTF(orderId);
        out.writeUTF(pId);
        out.writeDouble(amount);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
    
    
        orderId = in.readUTF();
        pId = in.readUTF();
        amount = in.readDouble();
    }
}

Result display: The first set of data is the output result without adding grouping auxiliary sorting logic.

​The second set of data is the original data of the file

The third set of data is the output result of adding grouping auxiliary sorting logic

image-20230727182743466

6. ReduceTask mechanism in MR program running

ReduceTask把数据分组好以后,一组相同的key调用一次Reduce方法,Reduce方法就可以去聚合数据,进行逻辑计算。

第一知识点:ReduceTask任务数的设置
	MR程序当中,MapTask的数量我们是基于切片数量自动确定的,我们人为无法手动设置MapTask的任务数,如果想修改MapTask的个数,我们无法直接修改,只能通过修改切片机制间接的修改MapTask的任务个数。
	MR程序当中,ReduceTask的数量机制和MapTask机制不太一样的,ReduceTask的任务个数是可以手动指定的。因此ReduceTask的数量给多少合适?默认要求ReduceTask的数量必须和分区数保持一致,因为一个Reduce任务处理一个分区的数据。
	【注意】在MR程序中,有一个比较特殊的机制,Reduce的数量可以设置为0,那么一旦ReduceTask的数量设置为0,那么MR程序只有Mapper阶段,没有reduce阶段,map阶段的输出就是整个MR程序的最终输出了。
一般写MR程序的时候,要求如果操作不涉及到对数据集整体的聚合操作(计算的结果需要从数据集整体中获得),我们都不建议大家增加Reduce阶段,因为增加了Reduce,MR执行效率会非常的低。

第二个知识点(非常重要):数据倾斜问题:Map阶段和Reduce阶段都存在
	数据倾斜不是MR程序运行原理,是MR程序在运行过程中可能会出现一种影响MR程序运行效率的情况。所谓的数据倾斜指的是多个ReduceTask处理的分区数据量差距过大,这样的话就会导致一个问题,有些ReduceTask会快速的运行完成,而有些ReduceTask运行时间非常久。
	如果我们发现所有reduceTask,有大部分运行很块结束了,而少部分Task运行时间过长,这一般都是因为数据倾斜问题的导致。解决方案很简单
		1、自定义分区机制,尽可能让各个分区的数据分布均匀一点
		2、消除热点key的数据,分区之所以不均匀,还有很大的可能性是因为确实有部分的key值出现的次数太多了。消除机制在处理数据时候,在热点key值的后面增加一些随机数。
		3、抽样分析数据 -- 数学算法和思想

Running case without reduce stage

package com.kang.noreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * 现在有一个文件。文件中有很多行数据,每一行数据都是以空格分割的多个单词组成的,
 * 现在要求通过MR程序实现将文件中所有以大写字母开头的英语单词过滤掉,最终输出结果文件,结果文件
 * 中只有以小写字母开头的英语单词
 */
public class DemoDriver {
    
    
    public static void main(String[] args) throws IOException, URISyntaxException, InterruptedException, ClassNotFoundException {
    
    
        Configuration configuration = new Configuration();
        configuration.set("fs.defaultFS","hdfs://192.168.31.104:9000");
        Job job = Job.getInstance(configuration);
        job.setJarByClass(DemoDriver.class);
        job.setInputFormatClass(TextInputFormat.class);
        FileInputFormat.setInputPaths(job,"/wordcount.txt");
        /**
         * 当前MR程序只要map阶段 没有reduce阶段
         * 也就意味着map阶段的输出就是最终的结果数据
         */
        job.setMapperClass(DemoMapper.class);
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);
        /**
         * 如果MR程序当中没有reduce阶段,一定一定要把reduce的任务数设置为0,
         * 如果没有设置为0,同时MR程序中没有指定reducer类,那么MR程序会默认自动给你添加reducer类,并且启动一个reduceTask
         * 自动生成的reduce类很简单,reduce类就是map输出的是什么结果 reduce原模原样输出
         */
        job.setNumReduceTasks(0);
        Path path = new Path("/output2");
        FileSystem fs = FileSystem.get(new URI("hdfs:192.168.31.104:9000"), configuration, "root");
        if (fs.exists(path)){
    
    
            fs.delete(path);
        }
        FileOutputFormat.setOutputPath(job,path);
        boolean flag = job.waitForCompletion(true);
        System.exit(flag?0:1);
    }
}

class DemoMapper extends Mapper<LongWritable, Text,NullWritable, Text>{
    
    
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, NullWritable, Text>.Context context) throws IOException, InterruptedException {
    
    
        String line = value.toString();
        String[] words = line.split(" ");
        for (String word : words) {
    
    
            if (word.charAt(0) >=97 && word.charAt(0) <= 122){
    
    
                context.write(NullWritable.get(),new Text(word));
            }else {
    
    
                continue;//过滤操作  只要map方法执行完成,没有调用context.write方法将数据写出 那么就代表当前数据需要舍弃
            }
        }
    }
}

image-20230727205344621

7. OutputFormat component in MR program running

OutputFormat is the output formatting class in the MR program. The abstract class defines a method getRecordWriter. The method is the core function of OutputFormat. The function of the method is to define how the kv data is written to when our Reduce or Map stage outputs the final result data. What rules are written in the file.

Common implementation classes of OutputFormat

  • TextOutputFormat: is the default implementation class of OutputFormat
    • The output is plain text document data. The rule for outputting data is to split each key-value data output by reduce with \t, and then a group of kv data occupies a separate line.
    • When outputting data through this class, the data will be placed in the specified file, and the default naming rule for the file is part-m/r-xxxxx.
  • SequenceFileOutputFormat
    • It is a special output file format in Hadoop, and the output file format is SequenceFile file.
    • SequenceFile file
      • The SequenceFile file is a special binary file provided by Hadoop. The binary file supports compressing the data and then writing it out to the file. This kind of file can greatly improve the efficiency of distributed computing to a certain extent. The format of data stored in SequenceFile files is kv format, but kv data is binary compressed data.
      • Sequencefile files support data compression or no compression. Overall, there are three file compression methods:
        • none: No compression of data
        • record: Only compress the value data in kv data
        • block: Compress multiple kv data

Custom OutputFormat

默认情况下,mapreduce是掉用TextOutputFormat类将MR程序的输出数据写出到了文件当中,文件的格式默认是将key-value以\t分割,并且输出的文件名是part-m/r-xxxxx。

除了TextOutputFormat之外,还有一个实现类SequenceFileOutputFormat,这个类是将文件以key-value的二进制形式进行输出,并且输出的二进制数据支持压缩,同时输出的文件名也是part-m/r-xxxxx。
以上这两个实现类默认一个reduceTask只输出一个文件。

在有些情况下,这两个实现类满足不了我们的输出需求,因此我们就得自定义InputFormat实现输出效果
	1、自定义一个类继承FileOutputFormat类
	2、重写getRecordWriter方法,方法需要返回一个RecordWriter的子类对象

case analysis

package com.kang.customoutput;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.PathOutputCommitterFactory;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * 实现单词计数案例,并且要求整个MR程序中只能有一个分区,一个reduceTask
 * 但是我们要求你将统计的单词计数结果,首字母大写的单词输出到upper.txt文件
 * 首字母小写的单词输出到lower.txt文件中
 *
 * 案例:一个reduceTask任务输出两个文件,两个文件名不是part-r/m-xxxxx
 */
public class WCDriver {
    
    
    public static void main(String[] args) throws IOException, URISyntaxException, InterruptedException, ClassNotFoundException {
    
    
        Configuration configuration = new Configuration();
        configuration.set("fs.defaultFS","hdfs://192.168.31.104:9000");

        Job job = Job.getInstance(configuration);
        job.setJarByClass(WCDriver.class);

        job.setInputFormatClass(TextInputFormat.class);
        FileInputFormat.setInputPaths(job,"/wordcount.txt");

        job.setMapperClass(WCMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        job.setReducerClass(WCReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        job.setNumReduceTasks(1);

        Path path = new Path("/wordcountOutput");
        FileSystem fs = FileSystem.get(new URI("hdfs://192.168.31.104:9000"), configuration, "root");
        if (fs.exists(path)){
    
    
            fs.delete(path);
        }
        job.setOutputFormatClass(WCOutputFormat.class);
        FileOutputFormat.setOutputPath(job,path);

        boolean flag = job.waitForCompletion(true);
        System.exit(flag?0:1);
    }
}

class WCMapper extends Mapper<LongWritable, Text,Text,LongWritable>{
    
    
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws IOException, InterruptedException {
    
    
        String line = value.toString();
        String[] words = line.split(" ");
        for (String word : words) {
    
    
            context.write(new Text(word),new LongWritable(1l));
        }
    }
}

class WCReducer extends Reducer<Text,LongWritable,Text,LongWritable>{
    
    
    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
    
    
        long sum = 0l;
        for (LongWritable value : values) {
    
    
            sum += value.get();
        }
        context.write(key,new LongWritable(sum));
    }
}
/**
 * 自定义一个OutputFormat类实现数据的输出规则 :
 *      首字母大写的单词 输出到upper.txt文件中
 *      首字母小写的单词 输出到lower.txt文件中
 */
class WCOutputFormat extends FileOutputFormat<Text,LongWritable>{
    
    
    /**
     * 如何输出数据的核心代码逻辑
     * @param context context是一个MR程序运行的上下问对象  可以通过上下文对象获取job的所有Configuration配置
     * @return
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    public RecordWriter<Text, LongWritable> getRecordWriter(TaskAttemptContext context) throws IOException, InterruptedException {
    
    
        /**
         * 采用内部类的方式创建一个RecordWriter的实现类 当作返回值返回
         */
        class WCRecordWriter extends RecordWriter<Text,LongWritable>{
    
    
            private FSDataOutputStream stream1;//这个输出IO流用来连接HDFS上的upper.txt文件
            private FSDataOutputStream stream2;//这个输出IO流用来连接HDFS上的lower.txt文件
            public WCRecordWriter() throws URISyntaxException, IOException, InterruptedException {
    
    
                this(context);
            }
            public WCRecordWriter(TaskAttemptContext context) throws URISyntaxException, IOException, InterruptedException {
    
    
                //这一行代码代表获取MR程序的所有配置对象
                Configuration configuration = context.getConfiguration();//通过上下文对象拿到配置文件
                String hdfsAddress = configuration.get("fs.defaultFS");//通过配置文件拿到hdfs的地址
                FileSystem fs = FileSystem.get(new URI(hdfsAddress), configuration, "root");//通过hdfs的地址连上hdfs
                /**
                 * 创建和HDFS的Io流,两个文件输出到Driver驱动程序中指定的输出目录下。输出目录按道理来说不能自己手动写
                 * 应该获取Driver设置的输出目录
                 */
                String outputDir = configuration.get(FileOutputFormat.OUTDIR);//设置Driver里面的输出目录的
                stream1 = fs.create(new Path(outputDir + "/upper.txt"));
                stream2 = fs.create(new Path(outputDir + "/lower.txt"));
            }
            /**
             * 如何写出数据,写出数据必须是两个文件
             * @param key the key to write.
             * @param value the value to write.
             * @throws IOException
             * @throws InterruptedException
             */
            @Override
            public void write(Text key, LongWritable value) throws IOException, InterruptedException {
    
    
                String word = key.toString();
                char c = word.charAt(0);
                if (c >=65 && c <=90){
    
    
                    String line = key.toString() + "=" + value.get() + "\n";
                    stream1.write(line.getBytes());
                    stream1.flush();
                }else {
    
    
                    String line = key.toString() + "=" + value.get() + "\n";
                    stream2.write(line.getBytes());
                    stream2.flush();
                }
            }

            @Override
            public void close(TaskAttemptContext context) throws IOException, InterruptedException {
    
    
                stream1.close();
                stream2.close();
            }
        }
        try {
    
    
            return new WCRecordWriter(context);
        } catch (URISyntaxException e) {
    
    
            throw new RuntimeException(e);
        }
    }
}

image-20230729103056008

Case study: write the results to the database

package com.kang.wordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;

public class WCDriver {
    
    
    public static void main(String[] args) throws IOException, URISyntaxException, InterruptedException, ClassNotFoundException {
    
    
        //1、准备一个配置文件对象
        Configuration configuration = new Configuration();
        configuration.set("fs.defaultFS","hdfs://192.168.31.104:9000");

        //2、创建一个封装MR程序使用Job对象
        Job job = Job.getInstance(configuration);
        job.setJarByClass(WCDriver.class);

        //指定输入文件路径  输入路径默认是本地的,如果你想要是HDFS上的 那么必须配置fs.defaultFS 指定HDFS的路径
        FileInputFormat.setInputPaths(job,new Path("/wordcount.txt"));

        /**
         * 4、封装Mapper阶段
         */
        job.setMapperClass(WCMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        /**
         * 6、封装Reducer阶段
         */
        job.setReducerClass(WCReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        job.setNumReduceTasks(1);

        //封装输出路径 输出路径不能提前存在,因此在代码中先判断是否存在,如果存在删除了
        Path path = new Path("/output");
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.31.104:9000"), configuration, "root");
        if (fileSystem.exists(path)){
    
    
            fileSystem.delete(path,true);
        }
        job.setOutputFormatClass(WCOutputFormat.class);
        FileOutputFormat.setOutputPath(job,path);

        /**
         * 8、提交程序运行
         *      提交的时候先进行切片规划,然后将配置和代码提交给资源调度器
         */
        boolean b = job.waitForCompletion(true);
        System.exit(b?0:1);
    }
}
class WCMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
    
    
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws IOException, InterruptedException {
    
    
        //long l = key.get();
        String line = value.toString();
        //System.out.println("map通过InputFormat机制读取的key值为" +l + "读取的value值为" + line);
        //System.out.println("map通过InputFormat机制读取的key值为" + key.toString() + "读取的value值为" + line);
        System.out.println("map通过InputFormat机制读取的key值为" + key.get() + "读取的value值为" + line);
        String[] words = line.split(" ");
        for (String word : words) {
    
    
            context.write(new Text(word),new LongWritable(1L));
        }
    }
}

class WCReducer extends Reducer<Text, LongWritable, Text, LongWritable>{
    
    
    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
    
    
        Long sum = 0L;
        for (LongWritable value : values) {
    
    
            sum += value.get();
        }
        context.write(key,new LongWritable(sum));
    }
}
class WCOutputFormat extends FileOutputFormat<Text,LongWritable>{
    
    

    @Override
    public RecordWriter<Text, LongWritable> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
    
    
        return new WCRecordWriter();
    }
}
class WCRecordWriter extends RecordWriter<Text,LongWritable>{
    
    
    private Connection connection;
    private PreparedStatement preparedStatement;
    public WCRecordWriter(){
    
    
        /**
         * 在无参构造器中先连接MySQL
         */
        try {
    
    
            Class.forName("com.mysql.cj.jdbc.Driver");
            connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/mr?serverTimezone=UTC&useUnicode=true&characterEncoding=UTF-8","root","root");
            String sql = "insert into wordcount(word,count) values(?,?)";
            preparedStatement = connection.prepareStatement(sql);
        } catch (ClassNotFoundException e) {
    
    
            throw new RuntimeException(e);
        } catch (SQLException e) {
    
    
            throw new RuntimeException(e);
        }
    }

    @Override
    public void write(Text key, LongWritable value) throws IOException, InterruptedException {
    
    
        String word = key.toString();
        Long count = value.get();
        try {
    
    
            preparedStatement.setString(1,word);
            preparedStatement.setInt(2,count.intValue());
            preparedStatement.executeUpdate();
        } catch (SQLException e) {
    
    
            throw new RuntimeException(e);
        }
    }

    @Override
    public void close(TaskAttemptContext context) throws IOException, InterruptedException {
    
    
        if (preparedStatement != null){
    
    
            try {
    
    
                preparedStatement.close();
            } catch (SQLException e) {
    
    
                throw new RuntimeException(e);
            }
        }
        if (connection != null){
    
    
            try {
    
    
                connection.close();
            } catch (SQLException e) {
    
    
                throw new RuntimeException(e);
            }
        }
    }
}

image-20230731102044351

8. Support and processing of SequenceFile files in Hadoop

SequenceFile文件是Hadoop提供的一种比较的特殊的文件,文件中存储的是key-value的二进制数据,而且SequenceFile文件支持对存储的二进制key-value数据进行压缩,是大数据中比较常用的一种数据文件,在Spark和Flink、Hive中有很多的情况下都是使用SequenceFile文件格式进行数据的保存等操作。

SequenceFile文件因为存储的是key-value数据的二进制类型数据,因此文件支持value或者key为图片、视频、音频数据。

The content in the SequenceFile file is composed of three parts:

  • Header
    • The Header area stores the type of key-value in the file, the compression method used for key and value, the algorithm rules used for compression, and the synchronization identifier.
  • Record area|block area
    • What is stored is the binary data of key and value. If compression is specified, what is stored is the binary compressed data of key-value.
  • sync-mark sync-mark

Three compression methods for SequenceFile files:

  • none: Key-value data is not compressed.
  • record: Only the value data in each key-value data is compressed, and the key value is not compressed.
  • block: Compress multiple key-value data, both key and value will be compressed.

In MapReduce, SequenceFile files can be processed, or the results can be output as SequenceFile files. The reason why MR can process this file is because MR provides two classes:

  • SequenceFileInputFormat

    • It is an implementation class of InputFormat, an InputFormat class specially used to read the SequenceFile file format.
    • When reading data, the keyvalues ​​in the file are read one by one, and this class can automatically identify whether the data is compressed and the collected compression method and compression algorithm based on the Header header information in the sequenceFile file. If the data has been compressed , use the compression algorithm provided in the header to perform decompression operations
    • At the same time, in the Header of the SequneceFile file, the key and value types are also specified (the types are serialization types). Then InputFormat will automatically convert the binary data after decompressing the key value into the corresponding key- value data type
    • [Note] If SequenceFileInputFormat is used, the key-value type input in the map stage is uncertain.
  • SequenceFileOutputFormat

    • It is an implementation class of OutputFormat, which supports the ability to output the final data result of Reduce into the SequenceFile file format using the compression method specified by us.

    • If we want this class to help us output data in SequenceFile file format, we must satisfy that the key-value output by MR must implement Hadoop's serialization mechanism.

    • job.setOutputFormatClass(SequenceFileOutputFormat.class);
      SequenceFileOutputFormat.setCompressOutput(job,true);
      SequenceFileOutputFormat.setOutputCompressionType(job, SequenceFile.CompressionType.RECORD);
      SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
      

Guess you like

Origin blog.csdn.net/weixin_57367513/article/details/132718278