Hadoop Study Notes (10): How MapReduce Works (Key Points)

1. The complete running process of MapReduce

 

Parse:

1 Start a job on the client.

2 Request a Job ID from the JobTracker .

3. Copy the resource files required to run the job to HDFS, including the jar files packaged by the MapReduce program, the configuration files, and the computing division information obtained by the client computing . These files are stored in a folder created by JobTracker specifically for the job. The folder name is the Job ID of the job. The jar file will have 10 copies by default (controlled by the mapred.submit.replication property); the input partition information tells the JobTracker how many map tasks should be started for this job.

4 After the JobTracker receives the job, it puts it in a job queue and waits for the job scheduler to schedule it (is it very similar to the process scheduling in the microcomputer), when the job scheduler schedules the job according to its own scheduling algorithm During the job, a map task is created for each partition according to the input partition information, and the map task is assigned to the TaskTracker for execution. For map and reduce tasks, TaskTracker has a fixed number of map slots and reduce slots according to the number of host cores and memory size . It should be emphasized here that the map task is not randomly assigned to a TaskTracker. There is a concept called: Data -Local . It means: assign the map task to the TaskTracker that contains the data block processed by the map, and the colleague will copy the program jar package to the TaskTracker to run, which is called " operation movement, data movement ". Data localization is not considered when assigning reduce tasks.

5 TaskTracker will send a heartbeat to JobTracker every once in a while to tell JobTracker that it is still running. At the same time, the heartbeat also carries a lot of information, such as the progress of the current map task completion. When the JobTracker receives the job's last task completion message, it sets the job to "success". When the JobTracker queries the status, it will know that the job is complete and display a message to the user.

2. Shuffle and sorting process of MapReduce tasks

 Map-side process analysis

1 Each input fragment will be processed by a map task. By default, the size of a block of HDFS (default 64M) is used as a fragment , of course, we can also set the size of the block. The result of map output will be temporarily placed in a ring memory buffer (the size of the buffer is 100M by default, controlled by the io.sort.mb property), when the buffer is about to overflow (the default is 80% of the buffer size) , controlled by the io.sort.spill.percent property), will create a spill file in the local file system and write the data in the buffer to this file.

2 Before writing to disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, one reduce task corresponds to the data of one partition . This is done to avoid the embarrassing situation that some reduce tasks are allocated a large amount of data, while some reduce tasks are allocated little or no data. In fact, partitioning is the process of hashing data . Then sort the data in each partition. If the Combiner is set at this time , the sorted result will be combined with the Combiner operation. The purpose of this is to write as little data to the disk as possible.

3 When the map task outputs the last record, there may be many overflow files, and these files need to be merged . In the process of merging, sorting and combiner operations are continuously performed for two purposes: 1. Minimize the amount of data written to the disk each time; 2. Minimize the amount of data transmitted over the network in the next replication phase. Finally merged into one partitioned and sorted file. In order to reduce the amount of data transmitted over the network, the data can be compressed here, as long as mapred.compress.map.out is set to true.

Data compression : Gzip, Lzo, snappy.

4 Copy the data in the partition to the corresponding reduce task. Some people may ask: How does the data in the partition know which reduce it corresponds to? In fact, the map task has always kept in touch with its parent TaskTracker, and the TaskTracker has always kept a heartbeat with the obTracker . Therefore, the macro information of the entire cluster is saved in the JobTracker . As long as the reduce task obtains the corresponding map output position from the JobTracker, it is OK.

Shuffle analysis

Shuffle means "shuffle" in Chinese. If we look at it this way: the data generated by a map is partitioned and allocated to different reduce tasks through the hash process. Is it a process of shuffling the data?

 The concept of shuffle:

  Collections.shuffle(List list) : Randomly shuffle the order of elements in the list.

  Shuffle in MapReduce : describes the process of data output from the map task to the input of the reduce task .

The process of Map-side shuffle:

1 Each map task has a memory buffer , which stores the output results of the map. When the buffer is almost full, the data in the buffer needs to be stored to the disk as a temporary file . All the temporary files generated by the map task in the disk are merged to generate the final official output file, and then wait for the reduce task to pull the data.

2 When the map task is executed, its input data comes from the block of HDFS. Of course, in the MapReduce concept, the map task only reads split . The correspondence between split and block may be many-to-one, and the default is one-to-one. In the wordcount example, assume that the input data to map are all strings like "aaa".

3 After the operation of mapper, we know that the output of mapper is such a key/value pair: the key is "aaa", and the value is the value 1. Because the current map side only adds 1, the result set is merged in the reduce task. Earlier we know that this job has 3 reduce tasks. So which reduce should the current "aaa" be thrown to deal with? A decision needs to be made now.

4 MapReduce provides the Partitioner interface , which is used to determine which reduce task the current output data should ultimately be processed by according to the number of keys or values ​​and reduce. By default, the key hash is modulo the reduce task data . The default modulo method is only to average the processing power of the reduce. If the user has a requirement for the Partitioner, it can be customized and set to the job.

5 In the example, "aaa" returns 0 after Partition, that is, this pair of values ​​should be processed by the first reduce. Next, the data needs to be written into the memory buffer. The function of the buffer is to collect map results in batches and reduce the impact of disk IO . Both our key/value pair and the result of the Partition are written to the buffer. Of course, before writing, both key and value values ​​are serialized into byte arrays.

6 The memory buffer is limited in size, the default is 100MB. When there are many output results of the map task, the memory may be burst , so it is necessary to temporarily write the data in the buffer to the disk under certain conditions, and then reuse the buffer. This process of writing data from memory to disk is called spill , which can be understood as overflow in Chinese . The overflow is done by a separate thread , and does not affect the thread that writes the map result to the buffer . The overflow thread should not block the output of the map result when it starts, so the entire buffer has a spill.percent ratio of overflow . The default ratio is 0.8 , that is, when the data value of the buffer has reached the threshold (buffer size * spill percent = 100MB * 0.8 = 80MB), the overflow writing thread starts, locks the 80MB of memory, and executes the overflow writing process. The output of the map task can also be written to the remaining 20MB of memory without affecting each other.

7 When the overflow write thread is started, the keys in the 80MB space need to be sorted . Sorting is the default behavior of the MapReduce model, and the sorting here is also the sorting of serialized bytes .

8 Because the output of the map task needs to be sent to different reducers, and the memory buffer does not merge the data to be sent to the same reducer, this merging should be reflected in the disk file. It can also be seen from the official diagram that some files written to the disk are merged with the values ​​of different reduce terminals . Therefore, a very important detail of the overflow writing process is that if there are many key/value pairs that need to be sent to a reducer, then these key/value values ​​need to be spliced ​​into one piece to reduce partition-related index records.

The shuffle process on the Reduce side:

 

Reduce-side process analysis

1 reduce will receive data from different map tasks, and the data from each map is ordered. If the amount of data received by the reduce side is quite small, it is stored directly in memory (the buffer size is controlled by the mapred.job.shuffle.input.buffer.percent property, indicating the percentage of heap space used for this purpose ), if the amount of data exceeds If a certain proportion of the buffer size is set (determined by mapred.job.shuffle.merg.percent ), the data will be overflowed and written to disk after merging.

2 As the overflowed files increase, the background thread will merge them into a larger ordered file, in order to save space for subsequent merges. In fact, whether on the map side or on the reduce side, MapReduce repeatedly performs sorting and merging operations. Now I finally understand why some people say: sorting is the soul of hadoop .

3 During the process of merging, many intermediate files (written to disk) will be generated, but MapReduce will make the data written to disk as little as possible, and the result of the last merge is not written to disk , but directly input to reduce function.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324973566&siteId=291194637