How Spark Shuffle works

1. What is spark shuffle?

Shuffle means "shuffle" in Chinese. The purpose of Shuffle in Spark is to ensure that the value corresponding to each key will be aggregated and processed in the same partition.

The shuffle process is essentially a process of dividing the data obtained on the Map side using a partitioner and sending the data to the corresponding Reducer. Shuffle is a bridge between Map and Reduce. The output of Map must go through shuffle when Reduce is used. The performance of shuffle directly affects the performance and throughput of the entire program. Because in a distributed situation, reduce tasks need to cross nodes to pull the results of map tasks on other nodes. This process will produce network resource consumption, memory, and disk IO consumption. Usually shuffle is divided into two parts: data preparation in the Map phase and data copy processing in the Reduce phase. Generally, Shuffle on the map side is called Shuffle Write, and Shuffle on the Reduce side is called Shuffle Read.

[External link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-uPjiBF9t-1599453922425)(/Users/lipan/app/typora-pic/1853022-20200706101850818-447869067.png) ]

A brief description of Shuffle on the map side:

  1. input, run the map task according to split input data;
  2. Patition, each map task has a memory buffer, which stores the output result of the map;
  3. Spill, when the buffer is almost full, the data in the buffer needs to be stored to the disk as a temporary file;
  4. merge, when the entire map task is over, all temporary files generated by this map task in the disk will be merged to generate the final official output file, and then wait for the reduce task to pull the data.

Brief description of Shuffle on the reduce side:

The job of the reduce task before execution is to continuously pull the final result of each map task in the current job, and then continuously merge the data pulled from different places, and finally form a file as the input file of the reduce task.

  1. Copy process, pull data.
  2. In the Merge phase, the small files that are pulled are merged
  3. Reducer calculation
  4. Output calculation result

We can present the Shuffle process as a data stream:

img

The figure vividly describes the entire process of MR data flow:

On the map side, there are 4 maps; on the Reduce side, there are 3 reduce. 4 maps are 4 JVMs, each JVM processes a data slice (split1~split4), each map generates a map output file, but each map generates 3 parts of data for the following reduce (respectively use red 1, green 2, blue 3), that is to say, each output map file contains 3 parts of data. As mentioned in the second section above: after the mapper runs, through the Partitioner interface, according to the key or value and the number of reduce to determine which reduce task the output data of the current map should ultimately be processed by. There are three reducers on the reduce side, go to the front Grab your own data from the output results of the 4 maps.

2. In Spark, under what circumstances will shuffle occur?

1. Deduplication: distinct

2.聚合:reduceByKey,groupBy,groupByKey,aggregateByKey,combineByKey

3. Sort: sortByKey, sortBy

4. Repartition: coalesce, repartition

5. Set or table operations: intersection, subtract, subtractByKey, join, leftOuterJoin

3.Shuffle operating principle

In Spark's source code, the main component responsible for the execution, calculation, and processing of the shuffle process is ShuffleManager. Before Spark 1.2, the default shuffle calculation engine was HashShuffleManager.

3.1 HashShuffleManager in early spark

Before Spark 1.2, the default shuffle calculation engine was HashShuffleManager. Its calculation mode is relatively simple and rude, and the details are as follows:

  • shuffle write stage

In this stage, the data processed by each task in the stage is "divided" according to operators. For example, reduceByKey executes the hash algorithm on the same key, so that the same is written to the same disk file, and each disk file belongs to only one task of the downstream stage. Before writing the data to the disk, the data will be written into the memory buffer, and when the memory buffer is full, it will overflow to the disk file.

  • shuffle read stage

Each task of the stage needs to pull all the same keys in the calculation results of the previous stage from each node through the network to the node where it is located, and then perform operations such as key aggregation or connection. In the process of shuffle write, the task creates a disk file for each task in the downstream stage. Therefore, in the process of shuffle read, each task only needs to pull its own one from the node where all tasks of the upstream stage are located. Disk file is fine.

[External link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-cox530AT-1599453922442)(/Users/lipan/app/typora-pic/1853022-20200706103009020-900597974.png) ]

So for this simple and rude HashShuffleManager, there is a very serious drawback: a large number of intermediate disk files will be generated, such a large number of disk IO operations will greatly affect performance. The number of disk files is determined by the number of tasks in the next stage, that is, how many tasks are in the next stage, and how many disk files must be created for each task in the current stage. For example, the next stage has a total of 100 tasks, and each task of the current stage has to create 100 disk files. If the current stage has 50 tasks, a total of 5000 disk files will be created.

3.2 Optimized HashShuffleManager

Due to the original version of HashShuffleManager, HashShuffleManager was optimized in the later stage. The optimization mentioned here means that you can set a parameter, spark.shuffle.consolidateFiles=true. The default value of this parameter is false. Generally speaking, if we use HashShuffleManager, it is recommended to enable this option.

After the consolidation mechanism is turned on, in the shuffle write process, the task will not create a disk file for each task in the downstream stage. At this time, the concept of shuffleFileGroup will appear. Each shuffleFileGroup will correspond to a batch of disk files, the number of disk files and downstream The number of tasks for the stage is the same. At this time, tasks will be executed in parallel according to the number of Executors. Each task executed in parallel in the first batch will create a shuffleFileGroup and write the data to the corresponding disk file.

When Executor finishes executing a batch of tasks, and then executes the next batch of tasks, the next batch of tasks will reuse the existing shuffleFileGroup and write the data to the existing disk file instead of writing the new disk file. in. That is, tasks running on the same Executor will reuse the previous disk files. In this way, the disk files of multiple tasks can be effectively merged to a certain extent, thereby greatly reducing the number of disk files, thereby improving the performance of shuffle write.

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-dOJbkJ7i-1599453922447)(/Users/lipan/app/typora-pic/1853022-20200706103121764-793761010.png) ]

3.3 The current default SortShuffleManager

In versions after Spark 1.2, the default ShuffleManager has been changed to SortShuffleManager. The operating mechanism of SortShuffleManager is mainly divided into two types, one is the ordinary operating mechanism, and the other is the bypass operating mechanism. When the number of shuffle read tasks is less than or equal to the value of the spark.shuffle.sort.bypassMergeThreshold parameter (the default is 200), the bypass mechanism is enabled.

  • Normal operation mode

In the normal mode, the data will be written into a memory data structure first. At this time, according to different shuffle operators, different data structures can be selected. If it is a shuffle operator operated by aggregation, it uses the data structure of map (write to memory while aggregating), if it is a join operator, use the data structure of array (write directly to memory). When the memory capacity reaches the critical value, it is ready to overflow and write to the disk.
  Before the overflow is written to the disk file, the existing data in the memory data structure will be sorted according to the key. After sorting, the data will be written to the disk file in batches, with 10,000 pieces of data by default in each batch.
  At this time, the task overwrites to the disk, which will generate multiple temporary files, and finally merge all the temporary files into one large file. In the end, there are only two files left, one is the merged data file, and the other is the index file, which identifies the start offset and end offset of the downstream task data in the file. The downstream task reads the corresponding data file according to the index file. It should be noted that the two files mentioned here refer to two files generated by an upstream task, but not all tasks end up with only two files.

Insert picture description here

  • bypass operation mode

Conditions that trigger the bypass mechanism:

  1. The number of shuffle map tasks is less than the value of the spark.shuffle.sort.bypassMergeThreshold parameter (default 200)
  2. Shuffle operators that are not aggregates (such as groupByKey)

We all know that the time complexity of sorting cannot be better than O(nlogn) at the highest, so if the time complexity of sorting is saved, shuffle performance will be improved a lot. The difference between the bypass mechanism and the ordinary SortShuffleManager operating mechanism is that the bypass mechanism uses the O(1) time complexity of hash to replace the operation overhead of sorting, which improves the performance of this part.

The task creates a temporary disk file for each downstream task, hashes the data according to the key, and then writes the key to the corresponding disk file according to the hash value of the key. As above, when writing to a disk file, the memory buffer is also written first, and after the buffer is full, it overflows and writes to the disk file. Finally, all temporary disk files will also be merged into one disk file, and a separate index file will be created.

img

Spark mechanism summary: Before Spark 1.2, the default shuffle calculation engine was HashShuffleManager, because HashShuffleManager generates a large number of small disk files and has low performance. In Spark 1.2 and later versions, the default ShuffleManager is changed to SortShuffleManager. Compared with HashShuffleManager, SortShuffleManager has a certain improvement. The main reason is that while each task is performing a shuffle operation, although more temporary disk files will be generated, in the end all temporary files will be merged into one disk file, so each task has only one disk file . When the shuffle read task of the next stage pulls its own data, it only needs to read part of the data in each disk file according to the index.

The principle and optimization of Spark can refer to the link:

https://www.cnblogs.com/arachis/p/Spark_Shuffle.html

https://www.cnblogs.com/arachis/p/Spark_Shuffle.html

https://www.dazhuanlan.com/2019/12/07/5dead2acabaef/

Guess you like

Origin blog.csdn.net/lp284558195/article/details/108445831