Principle of MapReduce

1. MapReduce job running process

The following is a schematic diagram of the process I drew with visio2010:









Process analysis:


1. Start a job on the client side.


2. Request a Job ID from the JobTracker.


3. Copy the resource files required to run the job to HDFS, including the JAR file packaged by the MapReduce program, the configuration file, and the input partition information calculated by the client. These files are stored in a folder created by JobTracker specifically for the job. The folder name is the Job ID of the job. The JAR file will have 10 copies by default (controlled by the mapred.submit.replication property); the input partition information tells the JobTracker how many map tasks should be started for this job.


4. After the JobTracker receives the job, it puts it in a job queue and waits for the job scheduler to schedule it (is this very similar to the process scheduling in the microcomputer, huh), when the job scheduler uses its own scheduling algorithm When the job is scheduled, a map task will be created for each partition according to the input partition information, and the map task will be assigned to the TaskTracker for execution. For map and reduce tasks, TaskTracker has a fixed number of map slots and reduce slots according to the number of host cores and memory size. What needs to be emphasized here is that the map task is not randomly assigned to a TaskTracker. There is a concept here: Data-Local. It means: assign the map task to the TaskTracker containing the data block processed by the map, and copy the program JAR package to the TaskTracker to run, which is called "operation movement, data movement". Data localization is not considered when assigning reduce tasks.


5. The TaskTracker will send a heartbeat to the JobTracker every once in a while, telling the JobTracker that it is still running, and the heartbeat also carries a lot of information, such as the progress of the current map task completion. When the JobTracker receives the job's last task completion message, it sets the job to "success". When the JobClient queries the status, it will know that the task is complete and display a message to the user.

The above is to analyze the working principle of MapReduce at the level of client, JobTracker, and TaskTracker. Let's be a little more detailed and analyze and analyze from the level of map task and reduce task.

2. The process of Shuffle and sorting in Map and Reduce tasks I


also post the schematic diagram of the process I drew in visio:





Process analysis:

Map side:

1. Each input fragment will be processed by a map task. By default, the size of a block of HDFS (default is 64M) is used as a fragment, of course, we can also set the size of the block. The result of map output will be temporarily placed in a ring memory buffer (the size of the buffer is 100M by default, controlled by the io.sort.mb property), when the buffer is about to overflow (the default is 80% of the buffer size) , controlled by the io.sort.spill.percent property), will create a spill file in the local file system and write the data in the buffer to this file.

2. Before writing to disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, one reduce task corresponds to the data of one partition. This is done to avoid the embarrassing situation that some reduce tasks are allocated a large amount of data, while some reduce tasks are allocated little or no data. In fact, partitioning is the process of hashing data. Then sort the data in each partition. If the Combiner is set at this time, the sorted result will be subjected to the Combia operation. The purpose of this is to write as little data as possible to the disk.

3. When the map task outputs the last record, there may be many overflow files, and these files need to be merged. In the process of merging, sorting and combia operations are continuously performed for two purposes: 1. Minimize the amount of data written to the disk each time; 2. Minimize the amount of data transmitted by the network in the next replication phase. Finally merged into one partitioned and sorted file. In order to reduce the amount of data transmitted over the network, the data can be compressed here, as long as mapred.compress.map.out is set to true.

4. Copy the data in the partition to the corresponding reduce task. Some people may ask: How does the data in the partition know which reduce it corresponds to? In fact, the map task has always kept in touch with its parent TaskTracker, and the TaskTracker has always kept a heartbeat with the JobTracker. Therefore, the macro information of the entire cluster is saved in the JobTracker. As long as the reduce task obtains the corresponding map output position from the JobTracker, it is ok.

At this point, the map side is analyzed. So what exactly is Shuffle? Shuffle means "shuffle" in Chinese. If we look at it this way: the data generated by a map is partitioned through the hash process but assigned to different reduce tasks. Is it a process of shuffling the data? Ha ha.

Reduce side:

1. Reduce will receive data from different map tasks, and the data from each map is ordered. If the amount of data accepted by the reduce side is fairly small, it is stored directly in memory (the buffer size is controlled by the mapred.job.shuffle.input.buffer.percent property, which represents the percentage of heap space used for this purpose), if the amount of data If it exceeds a certain proportion of the buffer size (determined by mapred.job.shuffle.merge.percent), the data is merged and then overflowed and written to disk.

2. As overflow files grow, background threads will merge them into a larger ordered file, in order to save time for subsequent merges. In fact, no matter on the map side or the reduce side, MapReduce performs sorting and merging operations repeatedly. Now I finally understand why some people say: sorting is the soul of hadoop.

3. There will be many intermediate files (written to disk) during the merging process, but MapReduce will make the data written to the disk as little as possible, and the result of the last merge is not written to the disk, but is directly input to the reduce function .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326352150&siteId=291194637