I want to enter the big data MapReduce knowledge points (2)

Insert picture description here

01 Let's learn big data together

Today, what Lao Liu shared is the second part of MapReduce knowledge points. In the first part, the workflow of MapReduce is basically explained clearly. Now is to summarize the scattered knowledge points of MapReduce. The content outline this time is as follows:
Insert picture description here

02 Knowledge points to remember

Point 6: Custom partition

This sentence was mentioned in the fifth point in the previous article: Partitioning uses a partitioner. The default partitioner is HashPartitioner, and the relevant code is given. Now I will introduce the partition in detail.

Partition principle

MapReduce has its own default partitioner HashPartitioner. The key method is to use getPartition() to return the partition index of the current key-value pair.

The detailed process is ① Before the ring buffer overflows and writes to the disk, each key-value pair kv is input as the parameter of getPartition();

② Then, the hash value of the key in the key-value pair will be calculated, bitwise AND with MAX_VALUE, and then modulo the number of reduce tasks. Here, Lao Liu assumes that the number of reduce tasks is 4, then the file with overflow of the map task will be There are 4 partitions, the index of the partition is 0,1,2,3, then the output result of getPartition() is 0,1,2,3.

③ According to the calculation result, it will determine which partition the current key-value pair KV falls into. If it is 0, it will fall into the 0 partition of the overflow file.

④ Finally, it will be obtained through http by the corresponding reduce.

After that, let’s talk about custom partitions. Why are there custom partitions?

Because MR uses the default HashPartition partition, but the current business logic does not apply to the HashPartition partition, you need to design your own custom partition.

Here is an example, customizing the partition so that the key-value pairs with Dear, Bear, River, and Car as the keys in the file fall into the partitions with index 0, 1, 2, and 3 respectively.

Let's analyze its logic first, because it is a custom partition! You need to customize the partition class, then use this class to implement the Partitioner interface, and implement the partition logic in getPartition(), and finally set the reduce number to 4 in main(), which is roughly the case.

The key code is shared below:

public class CustomPartitioner extends Partitioner<Text, IntWritable> {
    
    
    public static HashMap<String, Integer> dict = new HashMap<String, Integer>();

    //定义每个键对应的分区index,使用map数据结构完成
    static{
    
    
        dict.put("Dear", 0);
        dict.put("Bear", 1);
        dict.put("River", 2);
        dict.put("Car", 3);
    }

    public int getPartition(Text text, IntWritable intWritable, int i) {
    
    
        //Dear、Bear、River、Car分别落入到index是0,1,2,3的分区中
        int partitionIndex = dict.get(text.toString());
        return partitionIndex;
    }
}

Then the custom partition is over, you can summarize the steps of custom partition.

Point 7: Custom Combiner
Insert picture description here
Look carefully at the red mark in this picture. The combine operation takes place at this place. It will change two (Dear, 1) into one (Dear, 2).

Why do we need to combine operation?

We assume that there are 100 million (Dear, 1) in the map. According to the original idea, the map side needs to store 100 million (Dear, 1), and then 100 million (Dear, 1) are obtained by reduce through the network, and then in The reduce end performs aggregation. In this way, the local disk IO on the map end and the network IO for data transmission from the map end to the reduce end are relatively large, and the network overhead is too large.

So we will need to figure out a way, before reduce1 pulls 100 million (Dear, 1) from map1, do a reduce summary in advance on the map side to get the result (Dear, 100000000), and then the result (A key-value pair) is transferred to reduce1? Of course it is possible, this operation is the combine operation.

The specific process of combine operation is as follows:

When the ring buffer of each map task fills up to 80%, it starts to overwrite the disk file.

During this process, partitions will be performed, and each partition will be sorted by keys. If combine is set, the combine operation will continue. If map output compression is set, compression will be performed.

When merging overflowed files, if there are at least 3 overflowed files, and the map-side combine is set, the combine operation will be triggered during the merging process;

However, if there are only 2 or 1 overflow files, the combine operation will not be triggered (because the combine operation is essentially a reduce, it needs to start the JVM virtual machine, which has a certain overhead)

Combine is essentially reduce; because the custom combine class inherits from the Reducer parent class

The MR code is as follows:

//在main()中进行设置
job.setCombinerClass(WordCountReduce.class)

Point 8: MR compression

Why is there MR compression?

In MR, in order to reduce disk IO and network IO, you can consider setting the compression function on the map side and the reduce side.

So how to set the compression function? Just add the following settings to the Configuration object in the main method:

//开启map输出进行压缩的功能
configuration.set("mapreduce.map.output.compress", "true");
//设置map输出的压缩算法是:BZip2Codec,它是hadoop默认支持的压缩算法,且支持切分
configuration.set("mapreduce.map.output.compress.codec", "org.apache.hadoop.io.compress.BZip2Codec");
//开启job输出压缩功能
configuration.set("mapreduce.output.fileoutputformat.compress", "true");
//指定job输出使用的压缩算法
configuration.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.BZip2Codec");

The MR compression is almost finished, you can continue to summarize.

Point 9: Custom InputFormat

Lao Liu mainly talked about the InputFormat process. Lao Liu mentioned in the first article of MapReduce. Assuming that the input file of MR has three blocks: block1, block2, block3, each block corresponds to a split, and each split Corresponds to a map task.

However, I didn't talk about how to divide the file, so I gave it directly. The following is about these contents.

Insert picture description here
First of all, the input files of MapReduce tasks are generally stored in HDFS. We mainly look at how the map task reads the shard data from HDFS.
Three key classes are involved here:

① InputFormat input format class

② InputSplit input segmentation class: getSplit(), InputFormat input format class divides the input file into individual segments InputSplit; each Map task corresponds to a split segment.

③ RecordReader class of RecordReader: createRecordReader(), RecordReader (record reader) reads the sliced ​​data, a row of records generates a key-value pair; pass in the map() method of the map task, call map().

Let me talk about why you need to customize InputFormat?

Regardless of hdfs or mapreduce, processing small files is detrimental to efficiency. In practice, it is inevitable to face the scenario of processing a large number of small files, then some solutions need to be taken at this time.

The optimization of small files is nothing more than the following ways:

① When data is collected, small files or small batches of data are combined into large files and then uploaded to HDFS (SequenceFile scheme).

② Before business processing, use mapreduce program to merge small files on HDFS; it can be realized by using custom InputFormat.

③ CombineFileInputFormat can be used to improve efficiency during mapreduce processing.

We can take the second step and customize the input format. Lao Liu can only talk about it this time. When Lao Liu's level of reading the source code becomes higher, I will give you a good talk.

03 Summary of knowledge points

Well, today’s MapReduce content is almost summarized, the content is still quite a lot, the difficulty is this custom InputFormat, Lao Liu just talked about it, when I have time in the future, Lao Liu sees that the level of the source code has become higher. Let me show you a good explanation.

Finally, if you have any problems, please contact the official account: Lao Liu who works hard; if nothing is wrong, we will write big data with Lao Liu.

Guess you like

Origin blog.csdn.net/qq_36780184/article/details/109905857