In a SQL Apache Spark trip (in)

In  "A SQL in Apache Spark trip (on)"  article, we introduce a SQL in the Apache  the Spark  trip Parser and Analyzer two processes, we continue to introduce connected text.

Article Directory

Optimization logic planning stage - Optimizer

Binding logic planning stage of the foregoing related Unresolved LogicalPlan transform operation obtained Analyzed Logical Plan, this Analyzed Logical Plan can be directly converted into Physical Plan then  Spark  execution. But if the direct so get it, get the Physical Plan is probably not optimal, because, in practice, many inefficient wording would create problems of efficiency, the need for further Analyzed Logical Plan, to give the better logical operator tree. Then, it came into being for the optimizer Optimizer SQL logical operator tree.

Optimizer at this stage is mainly based on the rule (Rule-based Optimizer, referred to as RBO), and most of the rules are heuristic rules, which is based on intuition or experience derived rules, such as row crop (filter out query does not need to use column), pushdown predicate (the filter as much as possible to the data source to sink), the constant accumulation (such as 1 + 2 calculate this in advance) and the constant replacement (such as SELECT * FROM table WHERE i = 5 AND j = i + 3 may be converted to SELECT * FROM table WHERE i = 5 AND j = 8) and the like.

And previously introduced similar binding logic planning stage, all the rules at this stage is to realize Rule abstract class, more rules a Batch, Batch composed of a plurality of batches, also the execution in RuleExecutor, because previously been introduced Rule of execution, this section will not repeat them.

So for the foregoing SQL statements, which optimization of this process will execute it? Follow the instructions here one by one execution order Rule.

Predicate pushdown

Spark SQL predicates in the pushdown is  PushDownPredicate achieved, the main filter process conditions as pushed down to the bottom, preferably the data source. So for SQL, using predicate pushdown optimization we get the logic described above plans are as follows:

In a SQL Spark trip
If you want to keep abreast of article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public number: iteblog_hadoop

As can be seen from the figure, the predicate pushdown Filter the Join operator directly before the pushdown (Note that the figure is viewed from the bottom). I.e. scanning t1 table when we start with ((((isnotnull (cid # 2) && isnotnull (did # 3)) && (cid # 2 = 1)) && (did # 3 = 2)) && (id # 0> 50000)) && isnotnull (id # 0) filtering criteria to filter out the data satisfy the condition; will also start with isnotnull (id # 8) && (id # 8> 50000) filter criteria filtered scan table t2 when meet the data criteria. After such an operation, you can greatly reduce the amount of data processed Join operator to speed up the computation.

Row crop

Row crop in Spark SQL is  ColumnPruning implemented. Because the table that we could have a lot of fields, but each time the query is very likely that we do not need to scan all fields, this time using the column can cut those queries to filter out unneeded fields, so that the amount of data scanned is reduced. So for SQL we described above, using a logical row crop optimization plan are as follows:

In a SQL Spark trip
If you want to keep abreast of article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public number: iteblog_hadoop

As can be seen from the figure, after cutting the column, t1 and table id value only need, two fields; T2 need only query table id field. This reduces the data transmission, if the underlying file format and a column memory (such Parquet), can greatly improve the scanning speed data.

Constant replacement

Constant replaced Spark SQL is  ConstantPropagation implemented. Is replaced with a constant variable, such as SELECT * FROM table WHERE i = 5 AND j = i + 3 may be converted to SELECT * FROM table WHERE i = 5 AND j = 8. This looks fine to me, but if the number of lines scanned very much can reduce the cost of a lot of computing time. After this optimization, the resulting logical plan is as follows:

In a SQL Spark trip
If you want to keep abreast of article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public number: iteblog_hadoop

Our inquiries have  t1.cid = 1 AND t1.did = t1.cid + 1 queries, which can be seen from the fact t1.cid already determined value, so we can use it to calculate the t1.did.

Constant accumulation

Constant accumulated in Spark SQL is  ConstantFolding implemented. This and similar constant replacement, was also at this stage some constant expressions calculate in advance. This change seems unlikely, but in a very large amount of data can reduce a lot of computing time, reduce the use of resources such as CPU. After this optimization, the resulting logical plan is as follows:

In a SQL Spark trip
If you want to keep abreast of article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public number: iteblog_hadoop

So the logical plan after optimization After optimization of the four steps above, were as follows:

== Optimized Logical Plan ==
Aggregate [sum(cast(v#16 as bigint)) AS sum(v)#22L]
+- Project [(3 + value#1) AS v#16]
   +- Join Inner, (id#0 = id#8)
      :- Project [id#0, value#1]
      :  +- Filter (((((isnotnull(cid#2) && isnotnull(did#3)) && (cid#2 = 1)) && (did#3 = 2)) && (id#0 > 5)) && isnotnull(id#0))
      :     +- Relation[id#0,value#1,cid#2,did#3] csv
      +- Project [id#8]
         +- Filter (isnotnull(id#8) && (id#8 > 5))
            +- Relation[id#8,value#9,cid#10,did#11] csv

Corresponding to FIG follows:

In a SQL Spark trip
If you want to keep abreast of article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public number: iteblog_hadoop

Here, the optimization logic planning phase is complete. In addition, Spark offers up to 70 built-in optimization Rule, please see  here .

Generate executable physical planning phase - SparkPlanner

Logic Spark program described earlier in which in fact are not to be performed, in order to be able to execute the SQL, must be translated into a physical plan, at this stage Spark will know how to execute the SQL up. Front and logical planning and optimization of binding is not the same, here are using the strategy (Strategy), and logical planning and optimization of binding described earlier after Transformations action, the type of tree has not changed, that is to say: Expression After Transformations the resulting still Transformations; Logical Plan after Transformations get or Logical Plan. By this stage, after the operation Transformations, change the type of tree, is converted by the Physical Plan into Logical Plan.

A logical plan (Logical Plan) after a series of strategies to obtain multiple physical plan (Physical Plans), physical plans Spark is SparkPlan implemented. And then through a plurality of physical plan cost model (Cost Model) obtained after the selection of the physical program (Selected Physical Plan), the whole process is as follows:

In a SQL Spark trip
If you want to keep abreast of article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public number: iteblog_hadoop

Cost Model is the corresponding CBO (Cost-based Optimizations, CBO, mainly by Huawei bigwigs realized, see  SPARK-16026  ), the core idea is to calculate the cost of each physical plan, and then get the best physical planning. But in the latest version of the Spark 2.4.3, this section does not implement, direct return multiple physical plan as a first list of the best physical plan, as follows:

lazy val sparkPlan: SparkPlan = {
    SparkSession.setActiveSession(sparkSession)
    // TODO: We use next(), i.e. take the first plan returned by the planner, here for now,
    //       but we will implement to choose the best plan.
    planner.plan(ReturnAnswer(optimizedPlan)).next()
}

And  SPARK-16026  incorporated CBO optimization is mainly described in the foregoing optimization logic program stage - Optimizer  for phase, corresponding to Rule  CostBasedJoinReorder, and the default is off, need  spark.sql.cbo.enabled or  spark.sql.cbo.joinReorder.enabled parameters open.
So to this node, the resulting physical plan is as follows:

== Physical Plan ==
*(3) HashAggregate(keys=[], functions=[sum(cast(v#16 as bigint))], output=[sum(v)#22L])
+- Exchange SinglePartition
   +- *(2) HashAggregate(keys=[], functions=[partial_sum(cast(v#16 as bigint))], output=[sum#24L])
      +- *(2) Project [(3 + value#1) AS v#16]
         +- *(2) BroadcastHashJoin [id#0], [id#8], Inner, BuildRight
            :- *(2) Project [id#0, value#1]
            :  +- *(2) Filter (((((isnotnull(cid#2) && isnotnull(did#3)) && (cid#2 = 1)) && (did#3 = 2)) && (id#0 > 5)) && isnotnull(id#0))
            :     +- *(2) FileScan csv [id#0,value#1,cid#2,did#3] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/iteblog/t1.csv], PartitionFilters: [], PushedFilters: [IsNotNull(cid), IsNotNull(did), EqualTo(cid,1), EqualTo(did,2), GreaterThan(id,5), IsNotNull(id)], ReadSchema: struct<id:int,value:int,cid:int,did:int>
            +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)))
               +- *(1) Project [id#8]
                  +- *(1) Filter (isnotnull(id#8) && (id#8 > 5))
                     +- *(1) FileScan csv [id#8] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/iteblog/t2.csv], PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,5)], ReadSchema: struct<id:int>

As can be seen from the above results, the planning phase physical data source is already known csv file read from the inside, but also to know the path, type of data file. And in reading the document, the direct filter conditions (PushedFilters) added to the list. At the same time, this becomes a Join BroadcastHashJoin, i.e. the Broadcast data to node t1 t2 table where the table. FIG follows:

In a SQL Spark trip
If you want to keep abreast of article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public number: iteblog_hadoop

Here, Physical Plan is completely generated. For reasons of space, I will deal with the rest of the SQL introduced in the next article, including code generation (WholeStageCodeGen) and the implementation of related and other things, so stay tuned.

Reprinted from past memories (https://www.iteblog.com/)
This link:  [a SQL in Apache Spark trip (in)] (https://www.iteblog.com/archives/2562.html)

Guess you like

Origin blog.csdn.net/BD_fuhong/article/details/94756337