What is MapReduce the Shuffle? Read this article, I want you to clear

Preface :( references)
Tencent distributed data warehouse (Tencent distributed Data Warehouse, referred to as TDW) were built on open source software Hadoop and Hive, and according to company data volume, complexity and other specific circumstances were a lot of optimization and transformation, the current single cluster the maximum size reached 5600 units, the number of daily operations reach more than 100 million, it has become the company's largest off-line data processing platform. In order to meet users more diverse computing needs, TDW is also to real-time direction, to provide users with a more efficient, stable and rich services.

TDW compute engine comprises two parts: a MapReduce is offset from the line, is a real-time bias the Spark, both contained inside an important process --Shuffle. This paper shuffle process is parsed and shuffle process compares two computing engines, the subsequent optimization of the direction of thinking and explore, look through our continued efforts, TDW computing engine run better place.

What Shuffle mean? Shuffle, shuffle or disturb the normal mean, it may be more familiar with the Java API in the Collections.shuffle (List) method, which randomly disrupt the order of elements in the parameter list. If you do not know what MapReduce is in Shuffle, then look at this picture:
Here Insert Picture Description
This is the official description of the Shuffle process, where you just know the approximate range of Shuffle - how to output map task effectively transmitted to reduce end , it can also be understood, the data output from the description Shuffle map task to reduce task entered this process.
In such a Hadoop cluster environment, perform most of the map task and reduce task is on a different node. Of course, when the need in many cases across nodes Reduce map task execution result to pulling on the other nodes. If the job has a lot of running cluster, then the normal execution of the task will be very serious for the internal cluster network resource consumption. This network consumption is normal, we can not limit, can do it is to maximize reduce unnecessary consumption. There in the node, compared to memory, disk IO impact on job completion time is considerable. From the most basic requirements, we may have to expect Shuffle process:

  • Complete pull data from the map task to reduce terminal end.
  • When pulling data across nodes, as much as possible to reduce the unnecessary consumption of bandwidth.
  • Reduce the impact of disk IO on task execution.

OK, we see here, we can stop and think about it, if it is their Shuffle to design this process, then your design goal is. I want to be able to optimize the main place to reduce the amount can pull the data and make use of memory rather than on disk.

In WordCount an example, it is assumed that there are eight three map task and reduce task. Seen from the figure, Shuffle processes across the map with both ends reduce, so here I will unfold in two parts. Take a look at the situation map end, as shown below:
Here Insert Picture Description
The figure may be the operation of a map task. Compare this with the left half of the official figure, you will find a lot of inconsistencies. Official figures do not clearly explain partition, sort and combiner role in the end at what stage. This figure clearly understand the map so that we end the whole process all the data ready to input from the map data.

The entire process divided four steps, in simple terms, each map task has a memory buffer stores the output map, and when the buffer is almost full when the need to buffer data in a way to store temporary files disk, when all temporary files after the end of the entire map task on the disk map task produced do this merger to generate the final official output file, and then wait reduce task to pull data.

Of course, every step here may contain multiple steps and details, the details one by one the following explanation:

Map task when executed, its input data from the Block HDFS, MapReduce concept of course, reads only the map task split. Correspondence between Split and block may be many-to-default is one. In WordCount case, we assume that the input data map is a string like "aaa". After running mapper, we learned that the output mapper is such a key / value pairs: key is "aaa", value is the value of 1. Because the end of the current map only add operation 1, before going to merge in the result set in the reduce task. Earlier we know that this job has three reduce task, in the end the current "aaa" which should reduce handed over to do it? Is the need to decide now.
MapReduce provides Partitioner interfaces, its role is to decide which of the current output data which reduce task should be handed over to the final processing based on the number or value and reduce the key. The default number of key hash to reduce task then modulo. The default modulo way just to reduce the average processing power, if the user's own demand for Partitioner, can be customized and set to the job.

In our example, "aaa" After Partitioner returns 0, that is, this value should be handed over to the first reducer to deal with. Next, the data needs to be written to the memory buffer, a buffer zone of bulk collection map result, reduce the impact of disk IO. Our key / value pairs and Partition result will be written to the buffer. Before writing course, key values and the value will be serialized into a byte array.
The memory buffer is limited in size, the default is 100MB. When the output map task a lot, it may explode memory, so it is necessary under certain conditions, the temporary data buffer is written to disk, and then re-use this buffer. The process from memory to write data to disk is called Spill, Chinese can be translated as overflow write, literally very intuitive. The overflow writing is done by a separate thread, does not affect the write buffer to map the results of threads. It should not prevent the output map of overflow when writing thread starts, so there is a whole buffer overflow written proportion spill.percent. The default ratio is 0.8, that is, when the buffer data has reached the threshold value (buffer size * spill percent = 100MB * 0.8 = 80MB), write overflow thread starts, this lock 80MB memory, perform overflow writing process. Map task outputs also the rest of the previous 20MB memory write, independently of each other. When the overflow write thread starts, you need to key in this space 80MB do sort (Sort). Sort MapReduce model is the default behavior of the sort here is the sequence of bytes to do the sort.

Here we can think of, because the output map task that needs to be sent to a different end to reduce, but do not merge memory buffer to reduce the same data will be sent to the end, then this merger should be reflected in a disk file . From the official map can also be seen in the overflow write to disk files is done for different values ​​of the merger reduce end. So overflow writing process is a very important detail is that, if there are a number of key / value pairs to be sent to a terminal to reduce, you will need to splice these key / value to a value, reducing the index records associated with the partition.

And when merging data for each reduce end, some of the data might look like this: "aaa" / 1, "aaa" / 1. WordCount for example, is simply to count the number of occurrences of the word, as if there are a number of key many times "aaa" appear, we should put their values ​​into a merger results in the same map task in this process is called reduce, also known as combine. But in terms of MapReduce, it refers only to reduce reduce end execution to fetch data from a plurality of map task to do calculations. In addition to reduce, informally consolidated data can only be counted as a combine. In fact, we know, will MapReduce Combiner is equivalent to the Reducer.

If the client is set too Combiner, then now is the time to use Combiner. Will have the same key of the key / value pairs add value, reduce the amount of data written to disk overflow. Combiner optimizes the intermediate results of MapReduce, so it will be used multiple times throughout the model. What a scene that Combiner to use it? From this analysis, Combiner Reducer output is the input of, Combiner not change the final result of the calculation. Combiner only be so that Reduce the input key / value and the output key / value exactly the same type, and does not affect the final result of the scene. Such accumulation, the maximum value and the like. Combiner got to use caution, if good use, the efficiency of its job to help, on the contrary will affect the final result of reduce.

Write each overflow will generate a spill on the disk write file output if the map is really great, there are many such write overflow occurs, the disk will have more than the corresponding overflow write file exists. When a map task actually completed, the data memory buffer overflows are all written to disk to form a spill write file. The final disk will have at least exist (if the output map is small, when the map execution is complete, it will only generate a spill write file) overflow write such a document, because in the end only one file, so you need to write these overflow file merge together, this process is called Merge.
Merge What? As in the previous example, "aaa" read come from a time when the map task is 5, a map from another time when reading is 8, because they have the same key, so you have to merge into a group, what is the group? For "aaa" is like this: { "aaa", [5 , 8, 2, ...]}, the value of the array is read out from a file in a different write overflow, then these values are then added together. Note that since the merge is to merge multiple files into one file write overflow, it may also have the same key exists, if client set too Combiner, Combiner is also used in the process to merge the same key.
So far, all the work map end have been completed, the final generation of this file is also stored in the TaskTracker enough to get a local directory. Each reduce task constantly get there by RPC from JobTracker information map task is completed, and if the reduce task to get notification and map task on a table TaskTracker execution is completed, the process Shuffle half started.

Simply put, reduce task before performing the work is constantly pulling current job in the end result of each map task, and then pull the data coming from different places constantly do merge, eventually forming a file as reduce task input file. Shown below:
Here Insert Picture Description
As detailed view map end, Shuffle can also be used in the process indicated on FIG reduce three end to summarize. The current premise reduce copy data it obtained from JobTracker which map task execution has ended. Before Reducer actually run all the time are pulling in data, do merge, and repeated to do. As in the previous embodiment, as described in sections below I detail Shuffle reduce side:

Copy process, simply pull data. Reduce process started some data copy thread (Fetcher), map task lies through HTTP requests TaskTracker get map task output file. Because the map task has ended, the files on the go TaskTracker management in the local disk.
Merge stage. The merge operation merge here map end, but the array is stored in a different value to the map-side copy. Copy the data will first come into a memory buffer, the buffer size is more flexible than the end of the map here, which is based heap size JVM settings.
Because the Shuffle phase Reducer does not run, it should give most of the Shuffle with memory. It should be emphasized that, merge three forms:

  • Memory to memory
  • Memory to disk
  • Disk-to-disk

Dir default form is not enabled. When the amount of data in the memory reaches a certain threshold, the start of the merge memory to disk. And map similar end, which is the overflow of the process of writing, this process if you have set Combiner, also enabled, and then generate a large number of overflow write files on the disk. The second way to merge has been in operation until the end of the map data when there is no end, and then start the third disk-to-disk generate the final merge way that overflow write file.

Reducer input file. After constantly merge, it will eventually generate a "final document." Why the quotes? Because this document may exist on disk, it may also be present in memory. For us, of course, we want it to be stored in memory, as a direct input Reducer, but by default, this file is stored on disk. When Reducer input file has been set, the entire Shuffle finally ended. Reducer is then executed, the results on HDFS.

Published 36 original articles · won praise 13 · views 1053

Guess you like

Origin blog.csdn.net/weixin_44598691/article/details/105013622
Recommended