In a SQL Apache Spark trip (on)

Spark  SQL is  Spark  one of many components in the most technologically sophisticated component that supports both SQL queries and DataFrame DSL. By introducing the SQL support, greatly reducing the learning and use of cost developers. Currently, the entire SQL, Spark ML, Spark Graph Structured Streaming and are run on the Catalyst Optimization & Tungsten Execution, as shown below:

In a SQL Spark trip
If you want to keep abreast of article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public number: iteblog_hadoop

Therefore, the normal SQL execution to go through SQL Parser parses SQL, and then through the Catalyst optimizer, and finally to Spark execution. The Catalyst The process is divided into a number of processes, including:

  • Analysis: The main use of Unresolved Logical Plan Catalog information parsed into Analyzed logical plan;
  • Logical Optimizations: use some Rule (rules) Analyzed logical plan resolves the Optimized Logical Plan;
  • Physical Planning: preceding Spark logical plan can not be executed, and this process is to convert the logical plan into a plurality of physical plans, and models using the cost (cost model) to select the best physical plan;
  • Code Generation: This process will generate a SQL query Java byte code.

So the whole execution process can use SQL The following figure shows:

In a SQL Spark trip
If you want to keep abreast of article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public number: iteblog_hadoop

Which the blue part is the part of the Catalyst optimizer processing, also part of the article focuses on. Here we have a simple SQL, for example, from the High-level perspective on a SQL in Spark trip. In this paper we use the following SQL query:

SELECT sum(v)

    FROM (

      SELECT

        t1.id,

    1 + 2 + t1.value AS v

    FROM t1 JOIN t2

      WHERE

    t1.id = t2.id AND

    t1.cid = 1 AND

    t1.did = t1.cid + 1 AND

      t2.id > 5) iteblog

SQL parsing stage - SparkSqlParser

To be able to run SQL queries Spark, the first step is certainly needs to resolve this SQL. In Spark 1.x version, SQL resolved in two ways:

  • Based Scala parser combinator
  • SQL-based analytic Hive of

You can  spark.sql.dialect be set. Although SQL parsing engine can be selected, but such programs have the following questions: Scala parser combinator parser sometimes give the wrong information, but there is a conflict in the definition of the syntax does not issue a warning; and Hive SQL parsing engine relies on Hive, which leads to poor scalability.

To solve this problem, start from Spark 2.0.0 version introduces a third-party parser tool ANTLR (see details  SPARK-12362 ), Antlr is a powerful syntax builder tool for reading, processing, execution and translation structured text or binary files, it is currently the most widely used Java language syntax Builder tool, our common resolve big data SQL have used this tool, including Hive, Cassandra, Phoenix, Pig and presto, etc. The latest version of the Spark using ANTLR4, lexical analysis of SQL syntax tree and build through this.

 

具体的,Spark 基于 presto 的语法文件定义了 Spark SQL 语法文件 SqlBase.g4(路径 spark-2.4.3\sql\catalyst\src\main\antlr4\org\apache\spark\sql\catalyst\parser\SqlBase.g4),这个文件定义了 Spark SQL 支持的 SQL 语法。如果我们需要自定义新的语法,需要在这个文件定义好相关语法。然后使用 ANTLR4 对 SqlBase.g4 文件自动解析生成几个 Java 类,其中就包含重要的词法分析器 SqlBaseLexer.java 和语法分析器SqlBaseParser.java。运行上面的 SQL 会使用 SqlBaseLexer 来解析关键词以及各种标识符等;然后使用 SqlBaseParser 来构建语法树。整个过程就类似于下图。

In a SQL Spark trip
如果想及时了解Spark、Hadoop或者HBase相关的文章,欢迎关注微信公众号:iteblog_hadoop

生成语法树之后,使用 AstBuilder 将语法树转换成 LogicalPlan,这个 LogicalPlan 也被称为 Unresolved LogicalPlan。解析后的逻辑计划如下:

== Parsed Logical Plan ==

'Project [unresolvedalias('sum('v), None)]

+- 'SubqueryAlias `iteblog`

   +- 'Project ['t1.id, ((1 + 2) + 't1.value) AS v#16]

      +- 'Filter ((('t1.id = 't2.id) && ('t1.cid = 1)) && (('t1.did = ('t1.cid + 1)) && ('t2.id > 5)))

         +- 'Join Inner

            :- 'UnresolvedRelation `t1`

            +- 'UnresolvedRelation `t2`

图片表示如下:

In a SQL Spark trip
如果想及时了解Spark、Hadoop或者HBase相关的文章,欢迎关注微信公众号:iteblog_hadoop

Unresolved LogicalPlan 是从下往上看的,t1 和 t2 两张表被生成了 UnresolvedRelation,过滤的条件、选择的列以及聚合字段都知道了,SQL 之旅的第一个过程就算完成了。

绑定逻辑计划阶段 - Analyzer

在 SQL 解析阶段生成了 Unresolved LogicalPlan,从上图可以看出逻辑算子树中包含了 UnresolvedRelation 和 unresolvedalias 等对象。Unresolved LogicalPlan 仅仅是一种数据结构,不包含任何数据信息,比如不知道数据源、数据类型,不同的列来自于哪张表等。Analyzer 阶段会使用事先定义好的 Rule 以及 SessionCatalog 等信息对 Unresolved LogicalPlan 进行 transform。SessionCatalog 主要用于各种函数资源信息和元数据信息(数据库、数据表、数据视图、数据分区与函数等)的统一管理。而Rule 是定义在 Analyzer 里面的,如下具体如下:

lazy val batches: Seq[Batch] = Seq(

    Batch("Hints", fixedPoint,

      new ResolveHints.ResolveBroadcastHints(conf),

      ResolveHints.ResolveCoalesceHints,

      ResolveHints.RemoveAllHints),

    Batch("Simple Sanity Check", Once,

      LookupFunctions),

    Batch("Substitution", fixedPoint,

      CTESubstitution,

      WindowsSubstitution,

      EliminateUnions,

      new SubstituteUnresolvedOrdinals(conf)),

    Batch("Resolution", fixedPoint,

      ResolveTableValuedFunctions ::                    //解析表的函数

      ResolveRelations ::                               //解析表或视图

      ResolveReferences ::                              //解析列

      ResolveCreateNamedStruct ::

      ResolveDeserializer ::                            //解析反序列化操作类

      ResolveNewInstance ::

      ResolveUpCast ::                                  //解析类型转换

      ResolveGroupingAnalytics ::

      ResolvePivot ::

      ResolveOrdinalInOrderByAndGroupBy ::

      ResolveAggAliasInGroupBy ::

      ResolveMissingReferences ::

      ExtractGenerator ::

      ResolveGenerate ::

      ResolveFunctions ::                               //解析函数

      ResolveAliases ::                                 //解析表别名

      ResolveSubquery ::                                //解析子查询

      ResolveSubqueryColumnAliases ::

      ResolveWindowOrder ::

      ResolveWindowFrame ::

      ResolveNaturalAndUsingJoin ::

      ResolveOutputRelation ::

      ExtractWindowExpressions ::

      GlobalAggregates ::

      ResolveAggregateFunctions ::

      TimeWindowing ::

      ResolveInlineTables(conf) ::

      ResolveHigherOrderFunctions(catalog) ::

      ResolveLambdaVariables(conf) ::

      ResolveTimeZone(conf) ::

      ResolveRandomSeed ::

      TypeCoercion.typeCoercionRules(conf) ++

      extendedResolutionRules : _*),

    Batch("Post-Hoc Resolution", Once, postHocResolutionRules: _*),

    Batch("View", Once,

      AliasViewChild(conf)),

    Batch("Nondeterministic", Once,

      PullOutNondeterministic),

    Batch("UDF", Once,

      HandleNullInputsForUDF),

    Batch("FixNullability", Once,

      FixNullability),

    Batch("Subquery", Once,

      UpdateOuterReferences),

    Batch("Cleanup", fixedPoint,

      CleanupAliases)

)

从上面代码可以看出,多个性质类似的 Rule 组成一个 Batch,比如上面名为 Hints 的 Batch就是由很多个 Hints Rule 组成;而多个 Batch 构成一个 batches。这些 batches 会由 RuleExecutor 执行,先按一个一个 Batch 顺序执行,然后对 Batch 里面的每个 Rule 顺序执行。每个 Batch 会之心一次(Once)或多次(FixedPoint,由
spark.sql.optimizer.maxIterations 参数决定),执行过程如下:

In a SQL Spark trip
如果想及时了解Spark、Hadoop或者HBase相关的文章,欢迎关注微信公众号:iteblog_hadoop

所以上面的 SQL 经过这个阶段生成的 Analyzed Logical Plan 如下:

== Analyzed Logical Plan ==

sum(v): bigint

Aggregate [sum(cast(v#16 as bigint)) AS sum(v)#22L]

+- SubqueryAlias `iteblog`

   +- Project [id#0, ((1 + 2) + value#1) AS v#16]

      +- Filter (((id#0 = id#8) && (cid#2 = 1)) && ((did#3 = (cid#2 + 1)) && (id#8 > 5)))

         +- Join Inner

            :- SubqueryAlias `t1`

            :  +- Relation[id#0,value#1,cid#2,did#3] csv

            +- SubqueryAlias `t2`

               +- Relation[id#8,value#9,cid#10,did#11] csv

As can be seen from the above results, t1 and t2 have been parsed into tables with id, value, cid and four columns of the table did, in which the data source table from csv file. And the location and type of each column of data have been identified, sum is interpreted as the Aggregate function. The following is a transition from Unresolved LogicalPlan Analyzed Logical Plan comparison to FIG.

In a SQL Spark trip
If you want to keep abreast of article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public number: iteblog_hadoop

Here, Analyzed LogicalPlan completely generated. For reasons of space, I will deal with the rest of the SQL introduced in the next article, including a logical plan optimization, code generation and other things, so stay tuned.
 

Reprinted from past memories (https://www.iteblog.com/)
This link:  [a SQL in Apache Spark trip (on)] (https://www.iteblog.com/archives/2561.html)

Guess you like

Origin blog.csdn.net/BD_fuhong/article/details/94756230