Seven must-know SQL performance optimizations in Spark 3.0

Seven must-know SQL performance optimizations in Spark 3.0

Past memory big data Past memory big data

This article comes from the sharing of the topic "SQL Performance Improvements at a Glance in Apache Spark 3.0" by Dr. Kazuaki Ishizaki, a senior technician from IBM Tokyo Research Institute, at Spark Summit North America 2020. For the video of this article, please refer to the third article of today's tweet. PPT Please pay attention to the past memory big data and reply to sparksql3 in the background to obtain it.

The official version of Spark 3.0 was released last month, and many functions have been updated. The official version of Apache Spark 3.0.0, which remembers big data in the past, is finally released, and the important features are fully analyzed. This article will introduce the optimization of Spark 3.0 in SQL.
Seven must-know SQL performance optimizations in Spark 3.0
The optimization of SQL mainly includes four directions:

  • For developer interaction;
  • Dynamic optimization;
  • Catalyst improvements;
  • Renewal of infrastructure.
    Seven must-know SQL performance optimizations in Spark 3.0

We also mentioned in the earlier article that Spark 3.0 processed a total of 3464 ISSUES! It is difficult for us to go through so many ISSUES one by one, so this session Dr. Kazuaki Ishizaki gave us an improvement in SQL.
Seven must-know SQL performance optimizations in Spark 3.0
The improvement in SQL mainly includes seven aspects:

  1. EXPLAIN new format;
  2. All Join supports hints;
  3. Adaptive query execution;
  4. Dynamic partition cutting;
  5. Enhance the cropping and push-down of nested columns;
  6. Enhanced code generation for aggregation;
  7. Support for new Scala and Java versions.

EXPLAIN new format

Seven must-know SQL performance optimizations in Spark 3.0
If we want to improve query performance, we need to understand how a query is optimized. First, we need to understand the query plan. Suppose we have a query as follows:


SELECT key, Max(val) 
FROM temp 
WHERE key > 0 
GROUP BY key 
HAVING max(val) > 0

Let's take a look at how Spark 2.4 and Spark 3.0 query plans for this SQL.
Seven must-know SQL performance optimizations in Spark 3.0
If you use EXLPAIN on Spark 2.4 to view the query plan, you can see that the output is too long! ! Because there are many unnecessary Attributes in each row. It is difficult for us to see at a glance what this is for.
Seven must-know SQL performance optimizations in Spark 3.0
Spark 3.0 adds the support of FORMATTED on the basis of EXPLAIN, and displays detailed information in a very concise format. This output mainly consists of two parts.

The first part is mainly a series of operators; the
second part is a series of Attributes

From the above output, we can see how Spark SQL handles this query at a glance. If you want to see more detailed information, such as output, then you can see the Attribute in the second part.

All Join supports hints

Seven must-know SQL performance optimizations in Spark 3.0

The second optimization of SQL is Join Type Hint.

In Spark 2.4, we can only give Hint hints to Broadcast, and other types of Join are not supported. In Spark 3.0, Hint hints are supported for all types of Join. We can either use hint directly in SQL or in DSL. This Join strategy chosen in Spark is very useful when we don't want it.

Adaptive query execution

Seven must-know SQL performance optimizations in Spark 3.0

The third optimization is adaptive query optimization.

  1. Through running statistics, three aspects are optimized:
  2. Automatically set a reasonable number of Reduce;
  3. Choose a better Join strategy to improve performance;
  4. Optimize the data in Tilt Join.
    These optimizations do not need to be manually tuned at all. In the Q77 query of TPC-DS, the performance is improved by 8 times.

Seven must-know SQL performance optimizations in Spark 3.0
The above is the operation status of SQL in 2.4. If there are 5 reduce processes for 5 partitions, you can see that Reduce 0 can be completed very quickly because there is little data; the second is Reduce 4; the slowest is Reduce 3, which As a result, the time consumption of this SQL query is mainly spent on Reduce 3, while other CPUs are idle!

Seven must-know SQL performance optimizations in Spark 3.0
In Spark 3.0, the appropriate number of Reduces will be automatically selected, so that no Redue is idle.

Seven must-know SQL performance optimizations in Spark 3.0

Spark 2.4 selects the Join strategy based on static statistics. For example, the static statistics of table1 and table2 above are 100GB and 80GB respectively, so the Join above uses Sort merge Join.

Seven must-know SQL performance optimizations in Spark 3.0

Spark 3.0 knows that this Join should choose broadcast Join based on the statistics at runtime.

Seven must-know SQL performance optimizations in Spark 3.0
The Join time of the two tables depends on the time of the largest partition. So if there is data skew in a partition during the Join process, the processing time can be imagined.
Seven must-know SQL performance optimizations in Spark 3.0
The adaptive execution of Spark 3.0 can split skewed partitions. So as to achieve a relatively considerable running time. For the adaptive execution of Spark 3.0, please refer to the introduction to Spark 3.0 Adaptive Query Optimization for Memory Big Data in the past, and this article accelerates the execution performance of Spark SQL at runtime.

Dynamic partition cropping

Seven must-know SQL performance optimizations in Spark 3.0

The fourth optimization is dynamic partition pruning. This can be seen in the article Understanding Apache Spark 3.0 Dynamic Partition Pruning and the article Understanding the use of Apache Spark 3.0 Dynamic Partition Pruning. I won't introduce it here. The following is an example of dynamic partition cutting
Seven must-know SQL performance optimizations in Spark 3.0

Enhanced cropping and push-down of nested columns

The fifth optimization is the optimization of nested columns.

Seven must-know SQL performance optimizations in Spark 3.0
Spark 2.4 supports optimization of nested columns in Parquet and only reads the required part, but this function is very limited. For example, Repartition does not support the clipping of nested columns. It needs to repartition the entire column before selecting the required part.

Seven must-know SQL performance optimizations in Spark 3.0
Spark 3.0 optimizes this part, so that the tailoring of nested columns supports all operators, including the repartition above, which greatly reduces IO because we reduce data reading. The following is an example of nested column clipping:
Seven must-know SQL performance optimizations in Spark 3.0

Seven must-know SQL performance optimizations in Spark 3.0
Spark 2.4 does not support push-down for nested columns (Parquet and ORC). You need to read all the data and then perform the Filter operation.

Seven must-know SQL performance optimizations in Spark 3.0

Spark has optimized this part, and now supports filtering pushdown for both Parquet and ORC nested columns.

Enhanced code generation for aggregation

The sixth optimization is to enhance code generation for aggregation

Seven must-know SQL performance optimizations in Spark 3.0
The complex aggregation of Spark 2.4 is very slow. This is because the complex queries of Spark 2.4 are not compiled into local code. So when you run TPC-DS, you will find that Q66 is very slow.

Seven must-know SQL performance optimizations in Spark 3.0
In Spark, Catalyst is responsible for translating queries into Java code. The HotSpot compiler in OpenJDK is responsible for translating Java code into local code.

Seven must-know SQL performance optimizations in Spark 3.0

But when the instructions in the method exceed 8000 Java bytecode, the HotSpot compiler gives up translating the Java code into local code.
Seven must-know SQL performance optimizations in Spark 3.0
The Catalyst of Spark 3.0 splits the aggregated logic into multiple smaller methods so that the HotSpot compiler can convert it to local code, so the query performance is much faster than 2.4. The following is an example;
Seven must-know SQL performance optimizations in Spark 3.0
the upper part of the above figure is using Spark 3.0, you can see that the largest method size is less than 8000, and the following figure is to turn off this function, the largest method size has exceeded 8000.

Support for new Scala and Java versions

Seven must-know SQL performance optimizations in Spark 3.0

Guess you like

Origin blog.51cto.com/15127589/2677779