Reasonable setting of the number of map and reduce tasks in the big data framework MapReduce

1 Overview

MapReduce is a highly abstract big data job execution component. There are two main job processes, map and reduce. This article mainly introduces how to set the number of map and reduce tasks in MapReduce, and how to properly set map and reduce. number of tasks. 

2 Starting from the source code analysis

(1) Analyze the JobSubmitter task submission class

JobStatus submitJobInternal(Job job, Cluster cluster) {
      ......
      //计算切片数量
      int maps = writeSplits(job, submitJobDir);
      //将切片数量设置为map任务数
      conf.setInt(MRJobConfig.NUM_MAPS, maps);
      ......
}

Description : In this task submission method, the number of slices is first calculated through the writeSplits function, and then the number of slices is set to the number of map tasks

(2) Analyze FileInputFormat file-based input formatting class

public List<InputSplit> getSplits(JobContext job) {
   ......
   long splitSize = computeSplitSize(blockSize, minSize, maxSize);
   ......
}

Description : getSplits is a specific slice algorithm function

The computeSplitSize function calculates the splitSize of the number of bytes to be cut each time. The specific algorithm is as follows:

Math.max(minSize, Math.min(maxSize, blockSize))

The algorithm idea is :

(1) If maxSize is less than blockSize, then maxSize bytes are the number of bytes to be cut each time

(2) If maxSize is greater than blockSize, then blockSize bytes are used as the number of bytes to be cut each time

For example: the total input file size is 28 bytes (carriage return and line feed occupy 2 bytes), and the content is as follows:

hello how are
thank you tom

Where maxSize = 15, minSize = 10, blockSize is the default block size of 131072 bytes, and it is calculated that the number of bytes to be cut is 15 bytes each time, then the 28 bytes are divided into 2 pieces in total, and the final generation 2 map tasks.

blockSize: HDFS block size, in bytes (Hadoop 2.0 default block size is 128M)

minSize: mapreduce.input.fileinputformat.split.minsize (minimum split bytes of the file)

maxSize: mapreduce.input.fileinputformat.split.maxsize (maximum split bytes of the file)

3 Number of custom map tasks

Through the above source code analysis, it can be concluded that the factors affecting the number of map tasks are mainly composed of three parameters, blockSize, minSize, maxSize, then we can customize the number of map tasks by configuring these parameters. There are many ways to customize the number of map tasks

3.1 Method 1: Configure the number of map tasks through the configuration file

Modify the mapred-site.xml file and add the following configuration

<property>
  <name>mapreduce.input.fileinputformat.split.minsize</name>
  <value>10</value>
</property>

<property>
  <name>mapreduce.input.fileinputformat.split.maxsize</name>
  <value>15</value>
</property>

3.2 Method 2: Set the number of map tasks in the program

FileInputFormat.setMinInputSplitSize(job, 10);
FileInputFormat.setMaxInputSplitSize(job, 15);

4 Customize the number of reduce tasks

Set the number of reduce tasks through the Job's setNumReduceTasks(3) method

5 Reasonably set the number of map tasks

A certain number of map tasks can improve the execution speed of the job, because these map tasks are completely processed in parallel, so how many map tasks are it reasonable to set? Hadoop official recommendation generally maintains 10-100 map tasks on each node, or a higher number can be set according to the CPU usage, but the premise is that the execution time of each map task needs more than 1 minute, because the task startup is It takes a certain amount of time. If the execution time of the map task is less than the startup time of the task, setting too many map tasks will lead to lower efficiency.

6 Reasonably set the number of reduce tasks

Hadoop officially recommends setting the number of reduce tasks as: 0.95 or 1.75 * (number of nodes * number of mapreduce.tasktracker.reduce.tasks.maximum)

(1) If it is multiplied by 0.95, then start all reduce tasks immediately after the map task is executed

(2) If it is multiplied by 1.75, then all reduce tasks will be started twice, which can better complete the load balancing from map to reduce

Too many reduce tasks will increase the overhead of the system, thus affecting the execution efficiency of the entire job, but it can achieve better load balancing and lower failure costs.

7 Summary

This article mainly analyzes the slicing process of MapReduce, the setting method of the number of map and reduce tasks, and how to correctly and reasonably configure the number of map and reduce tasks from the perspective of source code. If you have any questions, please leave a message!

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324583894&siteId=291194637