Adaptive makes Spark SQL more efficient and smarter

This article is forwarded from the technology world , the original link is http://www.jasongj.com/spark/adaptive_execution/

1 background

The optimizations introduced in the previous " Spark SQL / Catalyst Internal Principles and RBO " and " Spark SQL Performance Optimization Further CBO Cost-Based Optimization ", from the perspective of the characteristics of the query itself and the target data, ensure the efficiency of the final generated execution plan as much as possible Sex. but

  • Once the execution plan is generated, it cannot be changed. Even if the subsequent execution plan can be further optimized during the execution process, it can only be executed according to the original plan
  • CBO generates an optimal execution plan based on statistical information, which needs to generate statistical information in advance, which is costly and is not suitable for scenarios with frequent data updates
  • CBO estimates the intermediate result information based on the statistical information of the basic table and the impact of the operation on the data. It is only an estimate and not accurate

The Adaptive Execution introduced in this article will optimize subsequent execution based on intermediate data during execution, thereby improving overall execution efficiency. The core lies in two points

  • The execution plan can be dynamically adjusted
  • The adjustment is based on the accurate statistical information of the intermediate results

2 Dynamically set Shuffle Partition

2.1 Principle of Spark Shuffle

Spark Shuffle is generally used to partition the data in the upstream stage by Key to ensure that the same Key from different Mappers (representing the tasks of the upstream stage) enters the same Reducer (representing the tasks of the downstream stage). Generally used for group by or Join operations.
Spark Shuffle process

As shown in the figure above, the Shuffle has a total of 2 Mappers and 5 Reducers. Each Mapper will divide its data into five parts according to the same rules (defined by Partitioner). Each Reducer pulls its own piece of data from the two Mappers.

2.2 Problems with the original Shuffle

When using Spark SQL, by spark.sql.shuffle.partitionsthe number specified Partition Shuffle, i.e. the number of Reducer

This parameter determines the number of Partitions of all Shuffles included in a Spark SQL Job. As shown in the figure below, when the parameter value is 3, the number of Reducers in all Shuffles is 3
Spark SQL with multiple Shuffle

This method has the following problems

  • The number of Partitions should not be set too large
    • There are too many Reducers (referring to the tasks that execute Shuffle Read in the Spark Shuffle process), and the amount of data processed by each Reducer is too small. A large number of small tasks cause unnecessary task scheduling overhead and possible resource scheduling overhead (if Dynamic Allocation is enabled)
    • The number of Reducers is too large. If the Reducer writes directly to HDFS, a large number of small files will be generated, which will cause a large number of addBlock RPCs. Name node may become a bottleneck and affect other applications that use HDFS.
    • Too many Reducers write small files, which will cause a large number of getBlock RPCs when reading these small files later, which will have an impact on the name node.
  • The number of Partitions should not be set too small
    • The amount of data processed by each Reducer is too large, and the Spill-to-disk overhead will increase
    • Reducer GC time increase
    • If the Reducer writes to HDFS, each Reducer writes a large amount of data and cannot take full advantage of parallel processing.
  • It is difficult to guarantee that all Shuffles are optimal
    • Different Shuffles correspond to different amounts of data, so the optimal number of Partitions is also different. Using a uniform number of Partitions is difficult to ensure that all Shuffles are optimal
    • Timed tasks have different data volumes in different time periods, and the same Partition number setting cannot guarantee optimal execution in all time periods

2.3 Principle of Automatically Setting Shuffle Partition

As shown in the figure in the Spark Shuffle Principle section, the data volume of the 5 Partitions of Stage 1 are 60MB, 40MB, 1MB, 2MB, and 50MB respectively. Partitions of 1MB and 2MB are obviously too small (in actual scenes, some small partitions are only tens of KB or even tens of bytes)

After opening Adaptive Execution

  • After Spark's Shuffle Write of Stage 0 ends, according to the output of each Mapper, the data volume of each Partition is calculated, namely 60MB, 40MB, 1MB, 2MB, 50MB
  • Calculate the appropriate number of post-shuffle Partitions (ie Reducers) through ExchangeCoordinator (in this example, the number of Reducers is set to 3)
  • Start the corresponding number of Reducer tasks
  • Each Reducer reads one or more Shuffle Write Partition data (as shown in the figure below, Reducer 0 reads Partition 0, Reducer 1 reads Partition 1, 2, 3, Reducer 2 reads Partition 4)
    Spark SQL adaptive reducer 1

The three Reducers are allocated this way because

  • The targetPostShuffleInputSize defaults to 64MB, and the amount of data read by each Reducer does not exceed 64MB
  • If Partition 0 is combined with Partition 2, and Partition 1 is combined with Partition 3, even if it does not exceed 64 MB. But after reading Partition 0, then read Partition 2. For the same Mapper, if each Partition has less data, skipping to read multiple Partitions is equivalent to random reading, and the performance on HDD is not high.
  • The current approach is to only combine adjacent Partitions to ensure sequential reads and improve disk IO performance
  • This solution will only merge multiple small Partitions, and will not split a large Partition, because the splitting process needs to introduce a new round of Shuffle
  • Based on the above reasons, the default number of Partitions (5 in this example) can be larger and then merged by ExchangeCoordinator. If the number of partitions set is too small, Adaptive Execution will not work in this scenario

As can be seen from the above figure, Reducer 1 reads Partition 1, 2, and 3 from each Mapper. There are three lines. This is because in the original Shuffle design, each Reducer can only read data from a specific Mapper through a Fetch request. Read the data of a Partition. That is, in the above figure, Reducer 1 reads the data of Mapper 0 and requires 3 rounds of Fetch requests. For Mapper, the disk needs to be read three times, which is equivalent to random IO.

In order to solve this problem, Spark has added a new interface, one Shuffle Read can read the data of multiple Partitions. As shown in the figure below, Task 1 can read the data of Partition 0, 1, and 2 in Task 0 at the same time through one round of requests, reducing the number of network requests. At the same time, Mapper 0 reads and returns the data of three Partitions at one time, which is equivalent to sequential IO, thereby improving performance.
Spark SQL adaptive reducer 2

Since Adaptive Execution's automatic setting of Reducer is determined by ExchangeCoordinator based on Shuffle Write statistics, the number of Reducers for different shuffles can be different even in the same job, so that each shuffle is as optimal as possible.

The above problem of the original Shuffle example one of, when enabled Adaptive Execution, three times the number of Reducer Shuffle from the original becomes all 3 2,4,3.

Spark SQL with adaptive Shuffle

2.4 Use and optimization methods

By spark.sql.adaptive.enabled=truethus enabling Auto Set Shuffle Reducer this feature is enabled Adaptive Execution

By spark.sql.adaptive.shuffle.targetPostShuffleInputSizemay be provided for each target amount of data read Reducer, which in bytes, the default value is 64 MB. In the above example, if the value is set to 50 MB, the final effect is still as shown above, without splitting the 60MB of Partition 0. The specific reasons have been explained above

3 Dynamically adjust the execution plan

3.1 Insufficiency of a fixed execution plan

Before Adaptive Execution is not turned on, once the execution plan is determined, even if the subsequent execution plan is found to be optimized, it cannot be changed. As shown in the figure below, after the Shuffle Write of SortMergJoin ends, it is found that the Shuffle output of the Join side is only 46.9KB, and SortMergeJoin is still executed
Spark SQL with fixed DAG

At this time, SortMergeJoin can be changed to BroadcastJoin to improve overall execution efficiency.

3.2 SortMergeJoin principle

SortMergeJoin is a commonly used distributed Join method, which can be used in almost all scenarios that require Join. But in some scenarios, its performance is not the best.

The principle of SortMergeJoin is shown in the figure below

  • The two parties of Join are partitioned according to HashPartitioner with the Join Key as the Key, and the number of partitions is the same
  • During Shuffle Write, all tasks of Stage 0 and Stage 1 divide the data into 5 Partitions, and each Partition is sorted by Join Key
  • Stage 2 starts 5 tasks, and fetches the corresponding Partition data from all the tasks in Stage 0 and Stage 1 that contain Partition partition data. (If a Mapper does not contain the Partition data, Redcuer does not need to initiate a read request to it).
  • Task 2 of Stage 2 reads the data of Partition 2 from Task 0, 1, and 2 of Stage 0, and sorts it by MergeSort
  • Task 2 of Stage 2 reads the data of Partition 2 from Task 0 and Task 1 of Stage 1, and sorts them by MergeSort
  • Task 2 of Stage 2 uses SortMergeJoin to join the two steps at the same time as the above two steps of MergeSort

Spark SQL SortMergeJoin

3.3 BroadcastJoin principle

When the party participating in the Join is small enough to be all placed in the memory of the Executor, the Broadcast mechanism can be used to broadcast the entire RDD data to each Executor, and all Tasks running on the Executor can directly read its data. (In this article, the follow-up pictures, for the convenience of display, will put the entire RDD data in the Task box, and hide the Executor)

For large RDDs, according to the normal way, each Task reads and processes the data of a Partition, and at the same time reads the broadcast data in the Executor. The broadcast data contains the full amount of data of the small RDD, so it can be directly related to the large data processed by each task. Part of the RDD data directly Join
Spark SQL BroadcastJoin

According to the specific implementation of Join in Task, it can be divided into BroadcastHashJoin and BroadcastNestedLoopJoin. The following text does not distinguish between these two implementations, collectively referred to as BroadcastJoin

Compared with SortMergeJoin, BroadcastJoin does not require Shuffle, which reduces the overhead caused by Shuffle, and at the same time avoids data skew caused by Shuffle, thereby greatly improving job execution efficiency

At the same time, BroadcastJoin brings the overhead of broadcasting a small RDD. In addition, if the small RDD is too large to be stored in the Executor memory, BroadcastJoin cannot be used

For the Join of the basic table, you can directly obtain the size of each table through HDFS before generating the execution plan to determine whether it is suitable to use BroadcastJoin. But for the Join of the intermediate table, it is impossible to accurately determine the size of the intermediate table in advance to accurately determine whether it is suitable to use BroadcastJoin

" Spark SQL performance optimization and further CBO cost-based optimization " The CBO introduced in the article can infer the statistical information of the intermediate table through the statistical information of the table and the impact of each operation on the data statistical information, but the statistical information obtained by this method is not accurate enough . At the same time, this method requires the analysis of the table in advance, which has a large overhead

After opening Adaptive Execution, you can directly judge whether BroadcastJoin is applicable based on Shuffle Write data

3.4 Principles of dynamically adjusting the execution plan

As shown in the figure in the principle of SortMergeJoin above , SortMergeJoin needs to perform Shuffle Write on Stage 0 and Stage 1 using the same Partitioner.

After the Shuffle Write is over, the MapStatus of each ShuffleMapTask can be counted to obtain the data volume of each Partition of Stage 2 and the total data volume that Stage 2 needs to read when the original plan is executed. (Generally speaking, Partition is an attribute of RDD rather than an attribute of Stage. For convenience, this article does not distinguish between Stage and RDD. You can simply think of a Stage as having only one RDD. At this time, Stage and RDD are equivalent within the scope of this article.)

If one of the Stages has a small amount of data, it is suitable to use BroadcastJoin, and there is no need to continue to perform Stage 2 Shuffle Read. On the contrary, the data of Stage 0 and Stage 1 can be used for BroadcastJoin, as shown in the figure below
Spark SQL Auto BroadcastJoin

The specific approach is

  • Broadcast all the Shuffle Write results of Stage 1
  • Start Stage 2, the number of Partitions is the same as Stage 0, both are 3
  • Each Stage 2 and each Task reads the Shuffle Write data of each Task of Stage 0, and joins with the full data of Stage 1 obtained by broadcasting at the same time

**Note:** The broadcast data is stored in each Executor, and all tasks on it are shared. There is no need to broadcast a copy of data for each Task. In the above figure, in order to show more clearly why it is possible to join directly, a copy of the full data of Stage 1 is placed in each Task box of Stage 2

Although Shuffle Write has been completed, changing the subsequent SortMergeJoin to Broadcast can still improve execution efficiency

  • SortMergeJoin needs to perform Merge Sort on the data from Stage 0 and Stage 1 during Shuffle Read, and may require Spill to disk, which is expensive
  • When SortMergeJoin is used, all Tasks of Stage 2 need to take the output data of all Tasks of Stage 0 and Stage 1 (if there is data it needs), which will cause a large number of network connections. And when there are many tasks in Stage 2, it will cause a large number of random disk read operations, which is not efficient and affects the execution efficiency of other jobs on the same machine.
  • When SortMergeJoin is used, each Task of Stage 2 needs to fetch data from almost all Tasks of Stage 0 and Stage 1, which cannot make good use of Locality.
  • Stage 2 uses Broadcast instead. Each Task directly reads the data of each Task of Stage 0 (one-to-one), which can make good use of the Locality feature. It is best to directly start the Task of Stage 2 on the Executor used by Stage 0. If the Shuffle Write data of Stage 0 is not Spill but in the memory, the Task of Stage 2 can directly read the data in the memory, which is very efficient. If there is Spill, you can read data directly from local files, and read sequentially, which is far more efficient than reading data randomly through the network.

3.5 Use and optimization methods

The use of this feature is as follows

  • When spark.sql.adaptive.enabledand spark.sql.adaptive.join.enabledare set truewhen turned on Adaptive Execution of dynamic adjustment Join function
  • spark.sql.adaptiveBroadcastJoinThresholdSet the threshold for the conversion of SortMergeJoin to BroadcastJoin. If this parameter is not set, the threshold spark.sql.autoBroadcastJoinThresholdvalue is equal to
  • In addition to the conversion of SortMergeJoin to BroadcastJoin described in this article, Adaptive Execution can also provide other Join optimization strategies. Some optimization strategies may need to increase Shuffle. spark.sql.adaptive.allowAdditionalShuffleThe parameter determines whether to allow Shuffle to be added to optimize Join. Its default value is false

4 Automatic processing of data skew

4.1 Typical solutions for data skew

The article " Spark Performance Optimization-Solving the N Postures of Spark Data Skew " describes the hazards, causes, and typical solutions of data skew

  • Ensure that the file can be split to avoid data skew when reading HDFS
  • Ensure that Kafka's partition data is balanced to avoid data skew caused by reading Kafka
  • Adjust the degree of parallelism or customize the Partitioner to disperse a large number of different keys assigned to the same task
  • Use BroadcastJoin instead of ReduceJoin to eliminate Shuffle to avoid data skew caused by Shuffle
  • Use random prefixes or suffixes for tilted keys to disperse a large number of tilted keys, and at the same time expand the capacity of the small tables participating in the Join, so as to ensure the correctness of the Join results

4.2 Automatically resolve data skew

At present, Adaptive Execution can solve the problem of data skew during Join. The idea can be understood as processing a partially slanted Partition (the slanted judgment criterion is that the Partition data is N times the median of all Partition Shuffle Write) separately, similar to BroadcastJoin, as shown in the figure below
Spark SQL resolve joinm skew

In the above figure, the left and right sides are Stage 0 and Stage 1 that participate in Join (actually there should be two RDDs for Join, but as mentioned above, there is no distinction between RDD and Stage here), and the middle is Stage 2 that obtains the Join result.

Obviously Partition 0 has a relatively large amount of data. Here, it is assumed that Partition 0 meets the "tilted" condition, and the other 4 Partitions are not tilted.

Taking Task 2 corresponding to Partition as an example, it needs to obtain all the data belonging to Partition 2 in the three tasks of Stage 0, and use MergeSort to sort. Obtain all the data belonging to Partition 2 in the two tasks of Stage 1 at the same time and use MergeSort to sort them. Then perform SortMergeJoin on both

For Partition 0, multiple tasks can be started

  • In the above figure, two Tasks are started to process Partition 0 data, named Task 0-0 and Task 0-1 respectively
  • Task 0-0 reads the data belonging to Partition 0 in Stage 0 Task 0
  • Task 0-1 reads the data belonging to Partition 0 in Stage 0, Task 1 and Task 2, and performs MergeSort
  • Task 0-0 and Task 0-1 are all data belonging to Partition 0 from the two tasks of Stage 1
  • Task 0-0 and Task 0-1 use part of the data belonging to Partition 0 in Stage 0 and the full amount of data belonging to Partition 0 in Stage 1 to join

Through this method, the data of Partition 0 originally processed by one task is processed by multiple tasks, and the amount of data to be processed by each task is reduced, thereby avoiding the tilt of Partition 0

The processing of Partition 0 is somewhat similar to that of BroadcastJoin. But the difference is that both Task 0-0 and Task 0-1 of Stage 2 get the full amount of data belonging to Partition 0 in Stage 1 through the normal Shuffle Read mechanism instead of the variable broadcast in BroadcastJoin.

4.3 Use and optimization methods

The method to enable and tune this feature is as follows

  • It is spark.sql.adaptive.skewedJoin.enabledset to true to automatically process data inclined Join
  • spark.sql.adaptive.skewedPartitionMaxSplits Controls the upper limit of the number of tasks for processing a tilted Partition, the default value is 5
  • spark.sql.adaptive.skewedPartitionRowCountThresholdThe lower limit of the number of rows for a Partition to be regarded as a slanted Partition is set, that is, a Partition whose number of rows is lower than this value will not be treated as a slanted Partition. The default value is 10L * 1000 * 1000, which is 10 million
  • spark.sql.adaptive.skewedPartitionSizeThresholdSet the lower limit of the size of a Partition to be regarded as an inclined Partition, that is, a Partition whose size is smaller than this value will not be regarded as an inclined Partition. The default value is 64 * 1024 * 1024 which is 64MB
  • spark.sql.adaptive.skewedPartitionFactorThis parameter sets the tilt factor. If a is larger than the Partition spark.sql.adaptive.skewedPartitionSizeThreshold, while the median size is greater than the product of the factors of each Partition, or greater than the number of rows spark.sql.adaptive.skewedPartitionRowCountThresholdwhile the median is greater than the number of rows in each Partition multiplied by this factor, then it will be regarded as an inclined Partition

5 Spark series articles

Guess you like

Origin blog.csdn.net/lp284558195/article/details/107384916