Detailed mapreduce of hadoop

Today, this girl is chatting with you about mapreduce. Hematemesis and tidy up, sit down on the benches. If you make a mistake, please correct me.
Insert picture description here

First, let's understand what MapReduce is. It is mainly composed of two stages. Map and Reduce. Users only need to write two functions, map() and reduce(). You can complete simple distributed program calculations.

Process introduction:

Insert picture description here

①②③InputFormat

The InputFormat interface determines how the input file is divided into Hadoop blocks. InputFormat can get the split set from a job. Then provide a RecordReader (getRecordReader) for this split to read the data in each split.

public abstract class InputFormat<K, V> {
    
    
    public InputFormat() {
    
    
    }

    public abstract List<InputSplit> getSplits(JobContext var1) throws IOException, InterruptedException;

    public abstract RecordReader<K, V> createRecordReader(InputSplit var1, TaskAttemptContext var2) throws IOException, InterruptedException;
}

(1) Among them, getSplits (JobContext var1) is responsible for logically dividing a big data into many pieces. Then two parameters are recorded in InputSplit, the first is the start ID of this fragment, and the second is the data size of this fragment. Therefore, InputSplit does not store the entire data, but provides a method of how to fragment the data.
(2) The createRecordReader(InputSplit var1, TaskAttemptContext var2) method is based on the method defined by InputSplit. Return a RecordReader capable of reading fragmented records. getSplits is used to get the InputSplit calculated from the input file. createRecordReader() provides an implementation of the RecordReader mentioned earlier. Correctly read the key/value pair from InputSplit, such as LineRecordReader, which uses the key as the offset and Value as each line of data, so that all the InputFormat returned by createRecordReader() of the LineRecordReader is based on the offset value as the key, each line The data is read from the slice in the form of Value.
Fragmentation:

  • minSize: InputSplit minimum configuration. mapred.min.split.size is 1M by default.
  • goalSize: totalSize/task number. The default number of tasks is 1.
  • blockSize: The block size in HDFS. The default block size is 128M.
  • The block size is: Math.max(minSize, Math.min(goalSize, blockSize))
  • If the file size is less than 128M, the file will not be sliced, no matter how small the file is, it will be a separate slice and handed over to a maptask for processing. If there are a large number of small files, a large number of maptasks will be generated, which will greatly reduce the performance of the cluster.
  • How many copies are there, a few tasks, and a few slices.

④Enter the ring buffer

  • After the data output after the logic processing of the map function is output, the data will be collected and stored in the ring buffer through the OutPutCollector collector.
  • The size of the ring buffer area is 100M by default. When the saved data reaches 80%, the data in the buffer area will overflow to the disk for storage.

⑤⑥Overflow (partition, sort)

  • When the data in the ring buffer reaches 80% of its capacity, it will overflow to the disk for storage. During this process, the program will partition the data (default HashPartition) and sort (default fast sorting according to the key)
  • The continuously overflowing data in the buffer area forms multiple small files. Step ⑥ The arrow points to a lot of small files
  • Partition:
  • When the data overflows from the ring buffer to the file, it will be partitioned according to the user-defined partition function. If the user does not customize the function, the program will use the default partitioner to partition through the hash function. The advantage of hash partition is that it is more flexible. , Has nothing to do with the data type, simple implementation, only need to set the number of reducetask. The purpose of partitioning is to divide the entire large data block into multiple data blocks, which are processed by multiple reducetasks to output multiple files. Usually, custom partitions are used when the output data needs to be differentiated. For example, in the above traffic statistics case, if the final output data needs to be divided into several files for storage according to the province of the mobile phone number, a custom partition is required. Function, and set the number of reduce tasks equal to the number of partitions in the driver (job.setNumReduceTasks(5);) and specify the partition defined by yourself (job.setPartitionerClass(ProvincePartitioner.class)). When you need to obtain a unified output result, you do not need to customize the partition or set the number of reducetasks (the default is 1).
  • Custom partition functions sometimes cause data skew problems, that is, some partitions have a large amount of data, and the amount of data in each partition is uneven, which will cause the entire job time to depend on the reduce with the longest processing time. This should be avoided It happened.
    -Sort:
  • The entire mapreduce process involves multiple sorting of data. Files that overflow in the ring buffer, overflow small files are merged into large files, and multiple partition data on the reduce side are merged into one large partition data. All need to be sorted, and this The sorting rules are based on the compareTo method of the key.
  • The order of the data output on the map side is not necessarily the order of the input data on the reduce side, because the data is sorted between the two, but the order displayed on the file by the reduce side is the write order of the reduce function. In the absence of the reduce function, explicitly set the number of reduce to 0 in the driver function (setting to 0 means that there is no reduce phase, and there is no shuffle phase, so the data will not be sorted and grouped), Otherwise, although there is no reduce logic, there will still be a shuffle phase. The order in which the data is saved on the file after the map side processes the data is not the order in which the map function is written, but the order after shuffle grouping and sorting.

⑦Merge (add combiner)

-I added a combiner here,

  • The areas of the overflowing multiple small files are merged together (the area 0 and area 0 are merged into one area 0) to form a large file
  • Ensure that the data in the area is in order by merging
  • These multiple partition files are merged into large files through merging and sorting, and are grouped according to the key value (if the key value is the same, the value value will be grouped together in the form of an iterator).

-combiner (map simple reduce)

  • The bandwidth of the cluster limits the number of mapreduce jobs, so data transmission between map and reduce tasks should be avoided as much as possible. Hadoop allows users to process the output data of the map. Users can customize the combiner function (similar to the map function and the reduce function). Its logic is generally the same as the reduce function. The input of the combiner is the output of the map, and the output of the combiner is used as the input of the reduce . In many cases, you can directly use the reduce function as a combiner function (job.setCombinerClass(FlowCountReducer.class);).
  • The combiner is an optimized solution, so it is impossible to determine how many times the combiner function will be called. You can call the combiner function when the ring buffer overflows files, or you can call the combiner when the overflowed small files are merged into large files. But to ensure that no matter how many times the combiner function is called, it will not affect the final result.
  • So not all processing logic can use the combiner component, some logic will change the output result of the final rerduce after using the combiner function (such as finding the average of several numbers, you cannot use the combiner to find the average of the output results of each map. Value, and then find the average of these averages, which will cause the result to be wrong).
  • The meaning of the combiner is to locally summarize the output of each maptask to reduce the amount of network transmission. (The data originally passed to reduce is (a, (1,1,1,1,1,1...)). After using the combiner, the data passed to reduce becomes (a, (4,2,3,5...) ))
  • If there is no combiner, there is no step ⑦
  • Insert picture description here

⑧Group

  • The input data of reduce will be divided into a group according to whether the keys are equal. If the keys are equal, the value corresponding to these keys will be passed to the reduce function as an iterator object. The input data obtained by reduce is: the first group: (a, (1,1,1,1,1,...)) the second group: (b, (1,1,1,1,1...)). The reduce function is called once for each set of data.

⑨Merge

  • The node running reducetask downloads its own partition data from multiple map tasks to the local disk working directory through process 8. These multiple partition files are merged into large files through merging and sorting, and are grouped according to the key value (if the key value is the same, the value value will be grouped together in the form of an iterator).

⑩reducetask

  • reducetask obtains the grouped and sorted data from the local working directory, and performs the logical processing in the reduce function on the data.

11. Output

  • Each reducetask outputs a result file.

Guess you like

Origin blog.csdn.net/weixin_44695793/article/details/108166131