Spark shuffle parameters and tuning suggestions (transfer)

Original: http://www.cnblogs.com/arachis/p/Spark_Shuffle.html

spark.shuffle.file.buffer

  • Default: 32k
  • Parameter description: This parameter is used to set the buffer size of the BufferedOutputStream of the shuffle write task . Before writing the data to the disk file, it will be written to the buffer buffer first, and the overflow will be written to the disk only after the buffer is full.
  • Tuning suggestion: If the available memory resources for the job are sufficient, you can appropriately increase the size of this parameter (such as 64k ), thereby reducing the number of times the disk file is overwritten during the shuffle write process, which can also reduce the number of disk IOs and improve performance. . In practice, it is found that by adjusting this parameter reasonably, the performance will be improved by 1% to 5% .

spark.reducer.maxSizeInFlight

  • Default: 48m
  • Parameter description: This parameter is used to set the buffer buffer size of the shuffle read task , and this buffer buffer determines how much data can be pulled each time.
  • Tuning suggestion: If the available memory resources for the job are sufficient, you can appropriately increase the size of this parameter (for example , 96m ), thereby reducing the number of times of data pulling, which can also reduce the number of network transmissions, thereby improving performance. In practice, it is found that by adjusting this parameter reasonably, the performance will be improved by 1% to 5% .

spark.shuffle.io.maxRetries

  • Default: 3
  • Parameter description: When the shuffle read task pulls its own data from the node where the shuffle write task is located , if the pull fails due to a network abnormality, it will automatically retry. This parameter represents the maximum number of retries that can be made . If the pull is not successful within the specified number of times, it may cause the job to fail.
  • Tuning suggestion: For those jobs that include particularly time-consuming shuffle operations, it is recommended to increase the maximum number of retries (for example , 60 times) to avoid data pull failures caused by factors such as JVM full gc or network instability . In practice, it is found that adjusting this parameter can greatly improve the stability of the shuffle process for a large amount of data (several billions to tens of billions) .

spark.shuffle.io.retryWait

  • Default: 5s
  • Parameter description: The specific explanation is the same as above. This parameter represents the waiting interval for each retry to pull data , and the default is 5s .
  • Tuning suggestion: It is recommended to increase the interval time (such as 60s ) to increase the stability of the shuffle operation.

spark.shuffle.memoryFraction

  • Default: 0.2
  • Parameter description: This parameter represents the proportion of memory allocated to the shuffle read task for aggregation operations in the Executor memory. The default is 20% .
  • Tuning suggestion: This parameter is explained in Resource Parameter Tuning. If the memory is sufficient and persistent operations are rarely used, it is recommended to increase this ratio and give more memory to the shuffle read aggregation operation to avoid frequent reading and writing of disks during the aggregation process due to insufficient memory. In practice, it is found that reasonable adjustment of this parameter can improve the performance by about 10% .

spark.shuffle.manager

  • Default: sort
  • Parameter description: This parameter is used to set the type of ShuffleManager . After Spark 1.5 , there are three options: hash , sort and tungsten-sort . HashShuffleManager was the default option before Spark 1.2 , but Spark 1.2 and later versions are SortShuffleManager by default. tungsten-sort is similar to sort , but uses the off-heap memory management mechanism in the tungsten plan, which is more memory efficient.
  • Tuning suggestion: Since SortShuffleManager sorts data by default, if this sorting mechanism is required in your business logic, you can use the default SortShuffleManager ; and if your business logic does not need to sort data , it is recommended to refer to The following parameters are tuned to avoid sorting operations through the bypass mechanism or the optimized HashShuffleManager , while providing better disk read and write performance. It should be noted here that tungsten-sort should be used with caution, because some corresponding bugs have been found before . (Does it need to be sorted)

spark.shuffle.sort.bypassMergeThreshold

  • Default: 200
  • Parameter description: When ShuffleManager is SortShuffleManager , if the number of shuffle read tasks is less than this threshold (default is 200 ), the sorting operation will not be performed during the shuffle write process , but the data will be written directly in the way of the unoptimized HashShuffleManager . But in the end , all temporary disk files generated by each task will be merged into one file, and a separate index file will be created.
  • Tuning suggestion: When you use SortShuffleManager , if you do not need sorting operation, it is recommended to increase this parameter larger than the number of shuffle read tasks . Then the bypass mechanism will be automatically enabled at this time , and the map-side will not be sorted, reducing the performance overhead of sorting. However, in this way, a large number of disk files will still be generated, so the performance of shuffle write needs to be improved. (Turn up the parameter, bypass operation, no sorting )

spark.shuffle.consolidateFiles

  • Default: false
  • Parameter description: This parameter is valid if HashShuffleManager is used. If set to true , the consolidate mechanism will be turned on , and the output files of shuffle write will be greatly merged . In the case of a particularly large number of shuffle read tasks , this method can greatly reduce disk IO overhead and improve performance.
  • Tuning suggestion: If you really do not need the sorting mechanism of SortShuffleManager , in addition to using the bypass mechanism, you can also try to manually specify the spark.shffle.manager parameter as hash , use HashShuffleManager , and enable the consolidate mechanism at the same time . I have tried it in practice and found that its performance is 10%~30% higher than that of SortShuffleManager with bypass mechanism enabled . (without sorting, use hashshufflemanager+consolidateFile )

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325276395&siteId=291194637