Split division analysis in mapreduce (new version api)

During the interview process, I often like to ask a question: How is the number of maps in hadoop determined? But I found that there are still many interviewers who can't answer. This question is actually a relatively basic question, which is very helpful for understanding the principle of mapreduce.

I have time today to analyze the source code.
This article uses the version of hadoop 2.7.2 as the analysis, and the code link is as follows. —— [github code address]

This article uses the org.apache.hadoop.mapreduce package as an explanation (ie, the new API). The org.apache.hadoop.mapred package partitioning process is slightly different.

1. Determination of the number of maps

The number of maps is equal to the number of splits. We know that when mapreduce processes large files, it will divide large files into multiple ones according to certain rules, which can improve the parallelism of map.
What is divided is InputSplit, and each map handles one InputSplit. Therefore, there are as many InputSplits as there are as many maps.

2. Who is responsible for dividing the split

Mainly InputFormat. The InputFormat class has 2 important roles:

  • 1) Divide the input data into multiple logical InputSplits, where each InputSplit is used as the input of a map.
  • 2) Provide a RecordReader to convert the content of InputSplit into k, v key-value pairs that can be input as map.

The implementation of InputFormat in the new version is an abstract class, and its inheritance relationship is as follows:

write picture description here

From the above figure, we can see that FileInputFormat is a widely used class. If the input format is a file on hdfs, it basically uses a subclass of FileInputFormat. For example, TextInputFormat is used to process ordinary files, and SequenceFileInputFormat is used to process Sequence format files. .
The getSplits(JobContext job) method in the FileInputFormat class is the main logic for dividing splits.

For InputSplit, which is also an abstract class, there are several different implementations. It should be noted that the InputSplit divided by different InputFormats is also different, as shown below

write picture description here

For FileInputFormat, FileSplit is divided. For FileSplit, the more important ones are as follows

public class FileSplit extends InputSplit implements Writable {
  private Path file; //hdfs文件地址
  private long start; //该split的起始位置
  private long length; //该split的文件长度

3. How to divide the split

The main logic of division is getSplits(JobContext job) in the FileInputFormat class

3.1. Calculate splitsize

When dividing the split, the large file will be divided according to the splitsize.
The java code involved is as follows:

   long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
   long maxSize = getMaxSplitSize(job);
   long blockSize = file.getBlockSize();
   long splitSize = computeSplitSize(blockSize, minSize, maxSize);
   protected long computeSplitSize(long blockSize, long minSize,
                                  long maxSize) {
    return Math.max(minSize, Math.min(maxSize, blockSize));
  }
  • minSize : The minimum value of each split, the default is 1. getFormatMinSplitSize() is hard-coded in the code, and returns 1, unless the source code of hadoop is modified. getMinSplitSize(job) depends on the parameter mapreduce.input.fileinputformat.split.minsize , if this parameter is not set, return 1. So minSize defaults to 1.

  • maxSize: The maximum value of each split, if mapreduce.input.fileinputformat.split.maxsize is set, it is the value, otherwise it is the maximum value of Long.

  • blockSize : The default file storage BLOCK size set by HDFS. Note: This value is not necessarily unique and constant. This value may be different for different files on HDFS. Therefore, when dividing a file into splits, for each different file, the blocksize of the file needs to be obtained.

  • splitSize : According to the formula, the default is blockSize.

In this example, mapreduce.input.fileinputformat.split.maxsize=104857440 (100M), the blockSize of all files is 256M, so splitSize=100M

3.2, division split

The logic of division is as follows:

  • 1) Traverse each file in the input directory and get the file
  • 2) Calculate the length of the file, A: If the length of the file is 0, if mapred.split.zero.file.skip=truethe split is not divided; if it mapred.split.zero.file.skipis false, a split with length=0 is generated. B: If the length is not 0, skip to step 3
  • 3) Determine whether the file supports split: if it supports, skip to step 4; if not, the file will not be split, and a split will be generated, and the length of the split is equal to the length of the file.
  • 4) According to the current file, the calculation splitSizeis 100M in this article
  • 5) Determine 剩余待切分文件大小/splitsizewhether it is greater than SPLIT_SLOP(the value is 1.1, the code is dead written) If true, split into a split, update the size of the file to be split to the current value -splitsize, and split again. The length of the generated split is equal to splitsize; if false, the remainder is divided into a split, and the length of the generated split is equal to the size of the remaining files to be divided. The reason why judgment is needed 剩余待切分文件大小/splitsizeis mainly to avoid too many small splits. For example, there are 100 files with a size of 109M in the file. If it is splitSizeequal to 100M, if it is not judged 剩余待切分文件大小/splitsize, 200 splits will be generated. Among them, the size of 100 splits is 100M, and 100 of them are only 9M, and there are 100 splits that are too small. MapReduce is preferred to process large files, too many small splits will affect performance.

3.3. Case Analysis

The input files used in the text are as follows:

[test@dw01 ~]$ hadoop fs -dus -h /tmp/wordcount/input/*
hdfs://namenode.test.net:9000/tmp/wordcount/input/part-r-00000 246.32M
hdfs://namenode.test.net:9000/tmp/wordcount/input/part-r-00002 106.95M
hdfs://namenode.test.net:9000/tmp/wordcount/input/part-r-00003 7.09M
hdfs://namenode.test.net:9000/tmp/wordcount/input/part-r-0004 0

In this example, mapreduce.input.fileinputformat.split.maxsize=104857440 (100M), mapred.split.zero.file.skip=true, the size of all files blockSizeis 256M, so splitSize= 100M

  • hdfs://namenode.test.net:9000/tmp/wordcount/input/part-r-00000 is divided into 3 splits
  • hdfs://namenode.test.net:9000/tmp/wordcount/input/part-r-00002 is divided into 1 split
  • hdfs://namenode.test.net:9000/tmp/wordcount/input/part-r-00003 is divided into 1 split
  • hdfs://namenode.test.net:9000/tmp/wordcount/input/part-r-0004 does not divide the split

The article is the author's original article, if it is helpful to you, welcome to reward!

write picture description herewrite picture description here

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325383813&siteId=291194637