Spark execution plan explain use

1. Spark code processing flow

1.1 Detailed process of code processing

  1.  Convert the sql statement into an unresolved logical execution plan (unresolved means that only the correctness of the sql syntax is verified, and the correctness of the table name and column name is not verified)
  2. Use catalog to verify the table name and column name information in the first step, and convert it into a logical execution plan (catalog describes the attributes of the data set and the location of the data set)
  3. Then optimize our sql syntax to get the optimized logical execution plan
  4. The optimized logical execution plan is transformed into a physical execution plan
  5. Convert the physical execution plan into executable code according to the appropriate CBO (cost selection)
  6. Convert to rdd to perform tasks
     

1.2 Core process

  1. analyze
  2. logic optimization
  3. Generate Physical Execution Plan
  4. Evaluation Model Analysis
  5. code generation

2. Spark View Execution Plan

 2.1 usage of explain

  • explain(): Only show the physical execution plan. (used more)
  • explain(mode="simple"): Only show the physical execution plan.
  • explain(mode="extended"): Show physical execution plan and logical execution plan.
  • explain(mode="codegen") : Show executable Java code to be generated by Codegen. (used more)
  • explain(mode="cost"): Display the optimized logical execution plan and related statistics.
  • explain(mode="formatted"): output in a separated manner, it will output a more readable physical execution plan, and display the detailed information of each node.

Demonstration: we have the student table and the score table here, and connect the grouping operation.

sqlway=spark.sql("""
select student.s_id,count(1)
from student
left join score
on student.s_id=score.s_id
group by student.s_id
""")
sqlway.explain(mode="extended")#展示物理执行计划和逻辑执行计划。

Show the logical and physical execution plan results as follows:

 The various parts in the picture are explained as follows:

1. Unresolved logical execution plan: == Parsed Logical Plan ==
Meaning: The Parser component checks whether there are any problems in SQL syntax, and then generates an Unresolved (unresolved) logical plan without checking table names or column names.


2. Resolved Logical Execution Plan: == Analyzed Logical Plan ==
Meaning: Parse validation semantics, column names, types, table names, etc. by accessing the Catalog repository in Spark.


3. Optimized logical execution plan: == Optimized Logical Plan ==
Meaning: The Catalyst optimizer optimizes according to various rules.


4. Physical execution plan: == Physical Plan ==
Meaning: Generate java code execution
 

3. Spark execution plan reading

This section will be read through the execution produced by the code in the second section.

Note: The execution plan is read from bottom to top

3.1 Read Parsed Logical Plan

 This section shows the unresolved logical execution plan. From bottom to top, look at the table name, then join, and then aggregate.

3.2 Read the Analyzed Logical Plan

 This part is the execution plan after adding the catalog verification table name and column name. It is very similar to the previous part, but the relevant information of the table is added. The # sign represents the serial number of the column, and L represents a long integer.

3.3 Read the Optimized Logical Plan

 This part is the optimized logic execution plan, which adds functions such as judging null values ​​and automatic filtering, and optimizes the logic execution process.

3.4 Read the Physical Plan

In this section, some nouns in the physical execution plan are introduced:

  • HashAggregate : Indicates data aggregation. Generally, HashAggregate appears in pairs. The first HashAggregate is to locally aggregate the local data of the execution node, and the other HashAggregate is to further aggregate and calculate the data of each partition.
  • Exchange : Indicates shuffle, indicating that data needs to be moved on the cluster. Many times HashAggregate will be separated by Exchange.
  • Project: Indicates the projection operation in SQL, which is to select columns (for example: select name, age...)
  • BroadcastHashJoin : Indicates HashJoin based on broadcasting.
  • LocalTableScan : Indicates a full table scan of a local table.

Based on these, we can see that the physical execution plan will find the file location where the table is located, fetch the required columns, reduce (pre-aggregate), broadcast, join method, aggregated columns, and other information.

 

 

Guess you like

Origin blog.csdn.net/u011487470/article/details/128566602