And finally to the last one, and we in the previous two articles "in a SQL Apache Spark trip (on)" and "Apache Spark in a SQL trip (in)" describes the Spark SQL trip SQL parsing, logic binding plan to optimize the physical and logical plan plan generation phase, we will continue connect the text, introduced Spark full SQL code generation stage and the final execution.
Article Directory
Full stage code generation phase - WholeStageCodegen
We have already introduced a plan to generate a physical (Physical Plan) from the logical plan, but this plan was not directly to the physical Spark executed, Spark will still last for SparkPlan be treated with some of the Rule, the process is prepareForExecution process, these Rule as follows :
|
The above Rule in CollapseCodegenStages
the main event, which is well-known stage full of code generation, Catalyst code generation entry rule is this whole stage. Of course, if desired Spark full code generation phase, it is necessary to spark.sql.codegen.wholeStage
set to true (the default).
Why do you need code generation
Before introducing the code generation, we start to understand why the need to introduce Spark SQL code generation. Prior to Apache Spark 2.0, Spark SQL underlying implementation is Volcano Iterator Model (see on "Volcano-An Extensible and Parallel Query Evaluation System" ), this is the Goetz Graefe proposed in 1993, the vast majority of today's database processing system SQL at the bottom are based on this model. Implementation of this model can be summarized as follows: First, SQL database engine will be translated into a series of relational algebra operator or expression, and then rely on these relational algebra operators one by one to process the input data and produce results. At the bottom of each operator implements the same interface, such as are implemented next () method, and then the top-level operator next () call to the next sub-operator (), next sub-operator promoter () call operator SUN promoter next (), until the bottom of the next (), the following figure shows the specific process:
If you want to keep abreast of article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public number: iteblog_hadoop
Advantage Volcano Iterator Model is an abstract very simple, easy to implement, and may be expressed by any combination of complex query operator. But the disadvantages are also obvious, there are a lot of virtual function call will cause the CPU interrupt, and ultimately affect the efficiency of the implementation. The number of official blog brick contrast over the efficiency of the use of Volcano Iterator Model and handwritten code, and found handwritten code execution efficiency is ten times more!
Based on the above findings, beginning from Apache Spark 2.0, the community began to introduce a Whole-stage Code Generation, see SPARK-12795 , mainly want this to simulate hand-written code, so as to enhance the efficiency of Spark SQL. Whole-stage Code Generation from the 2011 Thomas Neumann published Efficiently Compiling Efficient Query Plans for Modern Hardware papers, the Tungsten is also part of the plan.
Tungsten code generator is divided into three parts:
- Code generation expression (expression codegen)
- Full code generation phase (Whole-stage Code Generation)
- Acceleration serialization and deserialization (speed up serialization / deserialization)
Code generation expression (expression codegen)
This fact, Spark 1.x there. Expression is the base class code generation org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator, under which seven subclasses:
If you want to keep abreast of article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public number: iteblog_hadoop
SQL program logic generated in our earlier isnotnull(id#8) && (id#8 > 5)
is the most basic expressions. It is also a Predicate, it will call org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate to generate expression of the code, the generated code is as follows:
|
上面就是对表达式 isnotnull(id#8) && (id#8 > 5)
生成的代码,里面用到了 org.apache.spark.sql.catalyst.expressions.And、org.apache.spark.sql.catalyst.expressions.IsNotNull 以及 org.apache.spark.sql.catalyst.expressions.GreaterThan 三个 Predicate 的代码生成,然后组成了上面的 SpecificPredicate 。SpecificPredicate 会对每行应用 eval
函数去判断是否满足条件,上面生成的 SpecificPredicate 类逻辑并不复杂,大家可以去细细品味。
表达式代码生成主要是想解决大量虚函数调用(Virtual Function Calls),泛化的代价等。需要注意的是,上面通过表达式生成完整的类代码只有在将 spark.sql.codegen.wholeStage
设置为 false 才会进行的,否则只会生成一部分代码,并且和其他代码组成 Whole-stage Code。
全阶段代码生成(Whole-stage Code Generation)
全阶段代码生成(Whole-stage Code Generation),用来将多个处理逻辑整合到单个代码模块中,其中也会用到上面的表达式代码生成。和前面介绍的表达式代码生成不一样,这个是对整个 SQL 过程进行代码生成,前面的表达式代码生成仅对于表达式的。全阶段代码生成都是继承自 org.apache.spark.sql.execution.BufferedRowIterator 的,生成的代码需要实现 processNext() 方法,这个方法会在 org.apache.spark.sql.execution.WholeStageCodegenExec 里面的 doExecute 方法里面被调用。而这个方法里面的 rdd 会将数据传进生成的代码里面 ,比如我们上文 SQL 这个例子的数据源是 csv 文件,底层使用 org.apache.spark.sql.execution.FileSourceScanExec 这个类读取文件,然后生成 inputRDD,这个 rdd 在 WholeStageCodegenExec 类中的 doExecute 方法里面调用生成的代码,然后执行我们各种判断得到最后的结果。WholeStageCodegenExec 类中的 doExecute 方法部分代码如下:
|
那么我们生成的代码长什么样呢?我们还是对前面文章的 SQL 进行分析,这个 SQL 生成的物理计划如下:
|
从上面的物理计划可以看出,整个 SQL 的执行分为三个阶段。为了简便起见,我们仅仅分析第一个阶段的代码生成,也就是下面物理计划:
|
通过全阶段代码生成,上面过程得到的代码如下:
|
上面代码逻辑很好理解,大部分代码我都注释了,其实就是对每行的 id 进行 isnotnull(id#8) && (id#8 > 5) 表达式判断,然后拿到符合条件的行。剩余的其他阶段的代码生成和这个类似,生成的代码有点多,我就不贴出来了,感兴趣的同学可以自己去看下。相比 Volcano Iterator Model,全阶段代码生成的执行过程如下:
如果想及时了解Spark、Hadoop或者HBase相关的文章,欢迎关注微信公众号:iteblog_hadoop
通过引入全阶段代码生成,大大减少了虚函数的调用,减少了 CPU 的调用,使得 SQL 的执行速度有很大提升。
代码编译
生成代码之后需要解决的另一个问题是如何将生成的代码进行编译然后加载到同一个 JVM 中去。在早期 Spark 版本是使用 Scala 的 Reflection 和 Quasiquotes 机制来实现代码生成的。Quasiquotes 是一个简洁的符号,可以让我们轻松操作 Scala 语法树,具体参见 这里。虽然 Quasiquotes 可以很好的为我们解决代码生成等相关的问题,但是带来的新问题是编译代码时间比较长(大约 50ms - 500ms)!所以社区不得不默认关闭表达式代码生成。
To solve this problem, Spark introduces Janino project, see SPARK-7956 . Janino is a small but super super fast Java ™ compiler. It is not only that a set of source files are compiled into bytecode files like javac tool, also some Java expression, block, type text ( class body) or a memory source files are compiled and the compiled bytecode directly loadable into the same running in the JVM. Janino not a development tool, but as an embedded run-time compiler, such as expression evaluation translators or similar server-side JSP page engine, more about Janino, see here . By introducing the generated code to compile Janino showed reduced time to compile the SQL expression 5ms. Spark used in the ClassBodyEvaluator
code following be compiled, see org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.
The main need is, code generation is carried out in Driver side, and the code is compiled in the Executor end.
SQL execution
And finally to the real implementation of the local SQL. This time will be executed on the stage Spark generated code, and then get the final result, the DAG executes the following:
If you want to keep abreast of article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public number: iteblog_hadoop
Reprinted from past memories (https://www.iteblog.com/)
This link: [a SQL in Apache Spark trip (next)] (https://www.iteblog.com/archives/2563.html)