Spark Learning Road (10) Shuffle Tuning of SparkCore Tuning

Excerpted from https://tech.meituan.com/spark-tuning-pro.html

I. Overview

The performance of most Spark jobs is mainly consumed in the shuffle link, because this link includes a large number of operations such as disk IO, serialization, and network data transmission. Therefore, it is necessary to tune the shuffle process if the performance of the job is to be improved. However, it must also be reminded that the main factors affecting the performance of a Spark job are code development, resource parameters, and data skew. Shuffle tuning can only account for a small part of the entire Spark performance tuning. Therefore, you must grasp the basic principles of tuning, and do not abandon the basics. Below we will give you a detailed explanation of the principle of shuffle, as well as the description of related parameters, and give suggestions for tuning each parameter.

Second, the definition of shuffle

The operation of Spark is mainly divided into two parts:

  One part is the driver, the core of which is the SparkContext;

  The other part is the Task on the Worker node, which runs the actual task. When the program is running, the Driver and Executor processes interact with each other: what task to run, that is, the Driver will assign the Task to the Executor, and the Driver will perform network transmission with the Executor; where to get the task data, that is, the Task will grab the data of other upstream Tasks from the Driver As a result, network results are continuously generated in this process. Among them, the process of the next Stage asking for data from the previous Stage, we call it Shuffle .

3. Overview of ShuffleManager development

In the Spark source code, the main component responsible for the execution, calculation and processing of the shuffle process is the ShuffleManager, that is, the shuffle manager. With the development of Spark versions, ShuffleManager is also iterating and becoming more and more advanced.

Before Spark 1.2, the default shuffle calculation engine was HashShuffleManager. The ShuffleManager and HashShuffleManager have a very serious drawback, that is, a large number of intermediate disk files are generated, which in turn affects performance due to a large number of disk IO operations.

Therefore, in versions after Spark 1.2, the default ShuffleManager was changed to SortShuffleManager. Compared with HashShuffleManager, SortShuffleManager has certain improvements. The main thing is that each Task will generate more temporary disk files during the shuffle operation, but in the end, all temporary files will be merged into one disk file, so each Task has only one disk file. . When the shuffle read task of the next stage pulls its own data, it only needs to read part of the data in each disk file according to the index.

Let's analyze the principles of HashShuffleManager and SortShuffleManager in detail.

Fourth, the operation principle of HashShuffleManager

In the Spark source code, the main component responsible for the execution, calculation and processing of the shuffle process is the ShuffleManager, that is, the shuffle manager. With the development of Spark versions, ShuffleManager is also iterating and becoming more and more advanced.

Before Spark 1.2, the default shuffle calculation engine was HashShuffleManager. The ShuffleManager and HashShuffleManager have a very serious drawback, that is, a large number of intermediate disk files are generated, which in turn affects performance due to a large number of disk IO operations.

Therefore, in versions after Spark 1.2, the default ShuffleManager was changed to SortShuffleManager. Compared with HashShuffleManager, SortShuffleManager has certain improvements. The main thing is that each Task will generate more temporary disk files during the shuffle operation, but in the end, all temporary files will be merged into one disk file, so each Task has only one disk file. . When the shuffle read task of the next stage pulls its own data, it only needs to read part of the data in each disk file according to the index.

Let's analyze the principles of HashShuffleManager and SortShuffleManager in detail.

4.1 Unoptimized HashShuffleManager

Illustration

text description

The above diagram illustrates the principle of the unoptimized HashShuffleManager. Here we first clarify an assumption: each Executor has only one CPU core, that is, no matter how many task threads are allocated to this Executor, only one task thread can be executed at the same time.

Let's start with shuffle write. The shuffle write phase is mainly to perform shuffle-type operators (such as reduceByKey) for the next stage after the calculation of a stage is completed, and "classify" the data processed by each task by key. The so-called "classification" is to perform the hash algorithm on the same key, so that the same key is written into the same disk file, and each disk file belongs to only one task of the downstream stage. Before writing the data to the disk, the data will be written to the memory buffer first, and when the memory buffer is full, it will overflow to the disk file.

So how many disk files should be created for the next stage for each task that executes shuffle write? It is very simple, how many tasks are in the next stage, and how many disk files are to be created for each task in the current stage. For example, the next stage has a total of 100 tasks, then each task of the current stage must create 100 disk files. If the current stage has 50 tasks, a total of 10 Executors, and each Executor executes 5 Tasks, then a total of 500 disk files will be created on each Executor, and 5000 disk files will be created on all Executors. It can be seen that the number of disk files generated by the unoptimized shuffle write operation is extremely staggering.

Then let's talk about shuffle read. Shuffle read is usually what is done at the beginning of a stage. At this time, each task of the stage needs to pull all the same keys in the calculation results of the previous stage from each node through the network to the node where it is located, and then perform key aggregation or connection operations. In the process of shuffle write, the task creates a disk file for each task in the downstream stage, so in the process of shuffle read, each task only needs to pull its own one from the nodes where all tasks in the upstream stage are located. disk file.

The pulling process of shuffle read is to aggregate while pulling. Each shuffle read task will have its own buffer buffer, and each time it can only pull data of the same size as the buffer buffer, and then perform aggregation and other operations through a Map in memory. After a batch of data is aggregated, the next batch of data is pulled and placed in the buffer for aggregation operations. And so on, until finally all the data is pulled and the final result is obtained.

4.2 Optimized HashShuffleManager

Illustration

text description

The above figure illustrates the principle of the optimized HashShuffleManager. The optimization mentioned here means that we can set a parameter, spark.shuffle.consolidateFiles. The default value of this parameter is false, and setting it to true enables the optimization mechanism. Generally speaking, if we use HashShuffleManager, it is recommended to enable this option.

After enabling the consolidate mechanism, during the shuffle write process, the task does not create a disk file for each task of the downstream stage. At this point, the concept of shuffleFileGroup will appear. Each shuffleFileGroup will correspond to a batch of disk files. The number of disk files is the same as the number of tasks in the downstream stage. As many CPU cores are on an Executor, as many tasks can be executed in parallel. Each task executed in parallel in the first batch will create a shuffleFileGroup and write the data to the corresponding disk file.

When the CPU core of the Executor finishes executing a batch of tasks and then executes the next batch of tasks, the next batch of tasks will reuse the existing shuffleFileGroup, including the disk files in it. That is to say, at this time, the task will write the data to the existing disk file, but not to the new disk file. Therefore, the consolidate mechanism allows different tasks to reuse the same batch of disk files, which can effectively combine the disk files of multiple tasks to a certain extent, thereby greatly reducing the number of disk files and improving the performance of shuffle write.

Assuming that the second stage has 100 tasks and the first stage has 50 tasks, there are still 10 Executors in total, and each Executor executes 5 tasks. Then, when the unoptimized HashShuffleManager is used originally, each Executor will generate 500 disk files, and all Executors will generate 5000 disk files. However, after optimization at this time, the calculation formula for the number of disk files created by each Executor is: the number of CPU cores * the number of tasks in the next stage. That is to say, each Executor will only create 100 disk files at this time, and all Executors will only create 1000 disk files.

Five, the operation principle of SortShuffleManager

The operation mechanism of SortShuffleManager is mainly divided into two types, one is the normal operation mechanism, and the other is the bypass operation mechanism. When the number of shuffle read tasks is less than or equal to the value of the spark.shuffle.sort.bypassMergeThreshold parameter (the default is 200), the bypass mechanism is enabled.

5.1 General operating mechanism

Illustration

text description

The above figure illustrates the principle of the normal SortShuffleManager. In this mode, the data will be written to a memory data structure first. At this time, different data structures may be selected according to different shuffle operators. If it is a shuffle operator of the aggregation class such as reduceByKey, the Map data structure will be used, and while the aggregation is performed through Map, the memory will be written at the same time; if it is a common shuffle operator such as join, the Array data structure will be used to write directly. into memory. Then, each time a piece of data is written into the memory data structure, it will be judged whether a certain critical threshold has been reached. If a critical threshold is reached, then an attempt is made to overflow the data in the in-memory data structure to disk and then empty the in-memory data structure.

Before the overflow is written to the disk file, the existing data in the memory data structure is sorted according to the key. After sorting, the data is written to the disk file in batches. The default batch number is 10,000 , that is, the sorted data will be written to the disk file in batches in the form of 10,000 data per batch. Writing to disk files is implemented through Java's BufferedOutputStream. BufferedOutputStream is a buffered output stream of Java. It first buffers the data in memory, and writes it to the disk file again when the memory buffer overflows, which can reduce the number of disk IOs and improve performance.

When a task writes all data into the memory data structure, multiple disk overflow write operations will occur, and multiple temporary files will also be generated. Finally, all the previous temporary disk files will be merged, which is the merge process. At this time, the data in all the previous temporary disk files will be read out, and then written into the final disk file in turn. In addition, since a task only corresponds to one disk file, which means that the data prepared by the task for the tasks of the downstream stage is in this file, a separate index file will be written, which identifies the data of each downstream task. The start offset and end offset of the data in the file.

SortShuffleManager greatly reduces the number of files due to a process of merging disk files. For example, the first stage has 50 tasks, a total of 10 Executors, each Executor executes 5 tasks, and the second stage has 100 tasks. Since each task ultimately has only one disk file, there are only 5 disk files on each Executor at this time, and only 50 disk files for all Executors.

5.2 bypass operation mechanism

Illustration

text description

The above figure illustrates the principle of bypass SortShuffleManager. The trigger conditions for the bypass operation mechanism are as follows:

  • The number of shuffle map tasks is less than the value of the spark.shuffle.sort.bypassMergeThreshold parameter.
  • Not an aggregate shuffle operator (such as reduceByKey).

 

At this time, the task will create a temporary disk file for each downstream task, hash the data according to the key, and then write the key to the corresponding disk file according to the hash value of the key. Of course, when writing a disk file, it is also written to the memory buffer first, and after the buffer is full, it is overflowed to the disk file. Finally, all temporary disk files are also merged into one disk file and a single index file is created.

The disk writing mechanism of this process is actually exactly the same as that of the unoptimized HashShuffleManager, because a surprising number of disk files must be created, but only one disk file will be merged at the end. Therefore, a small number of final disk files also make the performance of shuffle read better compared to the unoptimized HashShuffleManager.

The difference between this mechanism and the normal SortShuffleManager operating mechanism is that: first, the disk writing mechanism is different; second, no sorting is performed. That is to say, the biggest advantage of enabling this mechanism is that in the process of shuffle write, data sorting operation is not required, which saves this part of the performance overhead.

Six, shuffle related parameter tuning

The following are some of the main parameters in the Shffule process. Here, the functions, default values, and tuning suggestions based on practical experience of each parameter are explained in detail.

The parameter default values ​​of each version of Spark may be different. For details, please refer to the instructions on the official website:

(1) First select the corresponding Spark version: http://spark.apache.org/documentation.html

(2) Check the corresponding documentation again

spark.shuffle.file.buffer

  • Default: 32k
  • Parameter description: This parameter is used to set the buffer size of the BufferedOutputStream of the shuffle write task. Before writing the data to the disk file, it will be written to the buffer buffer first, and the overflow will be written to the disk only after the buffer is full.
  • Tuning suggestion: If the available memory resources for the job are sufficient, you can appropriately increase the size of this parameter (for example, 64k), thereby reducing the number of times the disk file is overwritten during the shuffle write process, which can also reduce the number of disk IOs and improve performance. . In practice, it is found that by adjusting this parameter reasonably, the performance will be improved by 1% to 5%.

spark.reducer.maxSizeInFlight

  • Default: 48m
  • Parameter description: This parameter is used to set the buffer buffer size of the shuffle read task, and this buffer buffer determines how much data can be pulled each time.
  • Tuning suggestion: If the available memory resources of the job are sufficient, you can appropriately increase the size of this parameter (for example, 96m), thereby reducing the number of times of data pulling, which can also reduce the number of network transmissions, thereby improving performance. In practice, it is found that by adjusting this parameter reasonably, the performance will be improved by 1% to 5%.

spark.shuffle.io.maxRetries

  • Default: 3
  • Parameter description: When the shuffle read task pulls its own data from the node where the shuffle write task is located, if the pull fails due to a network abnormality, it will automatically retry. This parameter represents the maximum number of retries that can be made. If the pull is not successful within the specified number of times, it may cause the job to fail.
  • Tuning suggestion: For those jobs that include particularly time-consuming shuffle operations, it is recommended to increase the maximum number of retries (for example, 60 times) to avoid data pull failures caused by factors such as JVM full gc or network instability. In practice, it has been found that for the shuffle process for a large amount of data (several billions to tens of billions), adjusting this parameter can greatly improve the stability.

spark.shuffle.io.retryWait

  • Default: 5s
  • Parameter description: The specific explanation is the same as above. This parameter represents the waiting interval for each retry to pull data, and the default is 5s.
  • Tuning suggestion: It is recommended to increase the interval time (such as 60s) to increase the stability of the shuffle operation.

spark.shuffle.memoryFraction (deprecated)

  • Default: 0.2
  • Parameter description: This parameter represents the proportion of memory allocated to the shuffle read task for aggregation operations in the Executor memory. The default is 20%.
  • Tuning suggestion: This parameter is explained in Resource Parameter Tuning. If the memory is sufficient and persistent operations are rarely used, it is recommended to increase this ratio and give more memory to the shuffle read aggregation operation to avoid frequent reading and writing of disks during the aggregation process due to insufficient memory. In practice, it is found that reasonable adjustment of this parameter can improve the performance by about 10%.

spark.shuffle.manager (deprecated)

  • Default: sort
  • Parameter description: This parameter is used to set the type of ShuffleManager. After Spark 1.5, there are three options: hash, sort and tungsten-sort. HashShuffleManager was the default option before Spark 1.2, but Spark 1.2 and later versions are SortShuffleManager by default. tungsten-sort is similar to sort, but uses the off-heap memory management mechanism in the tungsten plan, which is more memory efficient.
  • Tuning suggestion: Since SortShuffleManager sorts data by default, if this sorting mechanism is required in your business logic, you can use the default SortShuffleManager; and if your business logic does not need to sort data, it is recommended to refer to The following parameters are tuned to avoid sorting operations through the bypass mechanism or the optimized HashShuffleManager, while providing better disk read and write performance. It should be noted here that tungsten-sort should be used with caution, because some corresponding bugs have been found before.

spark.shuffle.sort.bypassMergeThreshold

  • Default: 200
  • Parameter description: When ShuffleManager is SortShuffleManager, if the number of shuffle read tasks is less than this threshold (the default is 200), the sorting operation will not be performed during the shuffle write process, but the data will be written directly in the way of the unoptimized HashShuffleManager. But in the end, all temporary disk files generated by each task will be merged into one file, and a separate index file will be created.
  • Tuning suggestion: When you use SortShuffleManager, if you do not need sorting operation, it is recommended to increase this parameter larger than the number of shuffle read tasks. Then the bypass mechanism will be automatically enabled at this time, and the map-side will not be sorted, reducing the performance overhead of sorting. However, in this way, a large number of disk files will still be generated, so the performance of shuffle write needs to be improved.

spark.shuffle.consolidateFiles (deprecated)

  • Default: false
  • Parameter description: This parameter is valid if HashShuffleManager is used. If set to true, the consolidate mechanism will be turned on, and the output files of shuffle write will be greatly merged. In the case of a particularly large number of shuffle read tasks, this method can greatly reduce disk IO overhead and improve performance.
  • Tuning suggestion: If you really do not need the sorting mechanism of SortShuffleManager, in addition to using the bypass mechanism, you can also try to manually specify the spark.shffle.manager parameter as hash, use HashShuffleManager, and enable the consolidate mechanism at the same time. I have tried it in practice and found that its performance is 10%~30% higher than that of SortShuffleManager with bypass mechanism enabled.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324911124&siteId=291194637