In a SQL Apache Spark trip (next)

And finally to the last one, and we in the previous two articles "in a SQL Apache Spark trip (on)"  and  "Apache Spark in a SQL trip (in)"  describes  the Spark  SQL trip SQL parsing, logic binding plan to optimize the physical and logical plan plan generation phase, we will continue connect the text, introduced  Spark  full SQL code generation stage and the final execution.

Article Directory

Full stage code generation phase - WholeStageCodegen

We have already introduced a plan to generate a physical (Physical Plan) from the logical plan, but this plan was not directly to the physical  Spark  executed, Spark will still last for SparkPlan be treated with some of the Rule, the process is prepareForExecution process, these Rule as follows :

protected def preparations: Seq[Rule[SparkPlan]] = Seq(

   PlanSubqueries(sparkSession),                          //特殊子查询物理计划处理

   EnsureRequirements(sparkSession.sessionState.conf),    //确保执行计划分区与排序正确性

   CollapseCodegenStages(sparkSession.sessionState.conf), //代码生成

   ReuseExchange(sparkSession.sessionState.conf),         //节点重用

   ReuseSubquery(sparkSession.sessionState.conf))         //子查询重用

The above Rule in  CollapseCodegenStages the main event, which is well-known stage full of code generation, Catalyst code generation entry rule is this whole stage. Of course, if desired Spark full code generation phase, it is necessary to  spark.sql.codegen.wholeStage set to true (the default).

Why do you need code generation

Before introducing the code generation, we start to understand why the need to introduce Spark SQL code generation. Prior to Apache Spark 2.0, Spark SQL underlying implementation is Volcano Iterator Model (see on  "Volcano-An Extensible and Parallel Query Evaluation System" ), this is the Goetz Graefe proposed in 1993, the vast majority of today's database processing system SQL at the bottom are based on this model. Implementation of this model can be summarized as follows: First, SQL database engine will be translated into a series of relational algebra operator or expression, and then rely on these relational algebra operators one by one to process the input data and produce results. At the bottom of each operator implements the same interface, such as are implemented next () method, and then the top-level operator next () call to the next sub-operator (), next sub-operator promoter () call operator SUN promoter next (), until the bottom of the next (), the following figure shows the specific process:

In a SQL Spark trip
If you want to keep abreast of article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public number: iteblog_hadoop

Advantage Volcano Iterator Model is an abstract very simple, easy to implement, and may be expressed by any combination of complex query operator. But the disadvantages are also obvious, there are a lot of virtual function call will cause the CPU interrupt, and ultimately affect the efficiency of the implementation. The number of official blog brick contrast over the efficiency of the use of Volcano Iterator Model and handwritten code, and found handwritten code execution efficiency is ten times more!

Based on the above findings, beginning from Apache Spark 2.0, the community began to introduce a Whole-stage Code Generation, see  SPARK-12795 , mainly want this to simulate hand-written code, so as to enhance the efficiency of Spark SQL. Whole-stage Code Generation from the 2011 Thomas Neumann published  Efficiently Compiling Efficient Query Plans for Modern Hardware papers, the Tungsten is also part of the plan.

Tungsten code generator is divided into three parts:

  • Code generation expression (expression codegen)
  • Full code generation phase (Whole-stage Code Generation)
  • Acceleration serialization and deserialization (speed up serialization / deserialization)

Code generation expression (expression codegen)

This fact, Spark 1.x there. Expression is the base class code generation org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator, under which seven subclasses:

In a SQL Spark trip
If you want to keep abreast of article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public number: iteblog_hadoop

SQL program logic generated in our earlier  isnotnull(id#8) && (id#8 > 5) is the most basic expressions. It is also a Predicate, it will call org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate to generate expression of the code, the generated code is as follows:

19/06/18 16:47:15 DEBUG GeneratePredicate: Generated predicate '(isnotnull(input[0, int, true]) && (input[0, int, true] > 5))':

/* 001 */ public SpecificPredicate generate(Object[] references) {

/* 002 */   return new SpecificPredicate(references);

/* 003 */ }

/* 004 */

/* 005 */ class SpecificPredicate extends org.apache.spark.sql.catalyst.expressions.codegen.Predicate {

/* 006 */   private final Object[] references;

/* 007 */

/* 008 */

/* 009 */   public SpecificPredicate(Object[] references) {

/* 010 */     this.references = references;

/* 011 */

/* 012 */   }

/* 013 */

/* 014 */   public void initialize(int partitionIndex) {

/* 015 */

/* 016 */   }

/* 017 */

/* 018 */   public boolean eval(InternalRow i) {

/* 019 */     boolean isNull_2 = i.isNullAt(0);         //判断id是否为空

/* 020 */     int value_2 = isNull_2 ?

/* 021 */     -1 : (i.getInt(0));

/* 022 */     boolean isNull_0 = false;

/* 023 */     boolean value_0 = false;

/* 024 */

/* 025 */     if (!false && !(!isNull_2)) {             //如果id为空那么整个表达式就是false

/* 026 */     } else {

/* 027 */       boolean isNull_3 = true;

/* 028 */       boolean value_3 = false;

/* 029 */       boolean isNull_4 = i.isNullAt(0);       //继续判断id是否为空

/* 030 */       int value_4 = isNull_4 ?                //根据id值为空获取对应的值

/* 031 */       -1 : (i.getInt(0));

/* 032 */       if (!isNull_4) {                    //如果id对应的值不为空,那么判断这个值是否大于5

/* 033 */

/* 034 */

/* 035 */         isNull_3 = false; // resultCode could change nullability.

/* 036 */         value_3 = value_4 > 5;

/* 037 */

/* 038 */       }

/* 039 */       if (!isNull_3 && !value_3) {

/* 040 */       } else if (!false && !isNull_3) {      //id之大于5

/* 041 */         value_0 = true;

/* 042 */       } else {

/* 043 */         isNull_0 = true;

/* 044 */       }

/* 045 */     }

/* 046 */     return !isNull_0 && value_0;   //这个就是表达式isnotnull(id#8) && (id#8 > 5)对每行执行的结果         

/* 047 */   }

/* 048 */

/* 049 */

/* 050 */ }

上面就是对表达式 isnotnull(id#8) && (id#8 > 5) 生成的代码,里面用到了 org.apache.spark.sql.catalyst.expressions.And、org.apache.spark.sql.catalyst.expressions.IsNotNull 以及 org.apache.spark.sql.catalyst.expressions.GreaterThan 三个 Predicate 的代码生成,然后组成了上面的 SpecificPredicate 。SpecificPredicate 会对每行应用 eval 函数去判断是否满足条件,上面生成的 SpecificPredicate 类逻辑并不复杂,大家可以去细细品味。

表达式代码生成主要是想解决大量虚函数调用(Virtual Function Calls),泛化的代价等。需要注意的是,上面通过表达式生成完整的类代码只有在将 spark.sql.codegen.wholeStage 设置为 false 才会进行的,否则只会生成一部分代码,并且和其他代码组成 Whole-stage Code。

全阶段代码生成(Whole-stage Code Generation)

全阶段代码生成(Whole-stage Code Generation),用来将多个处理逻辑整合到单个代码模块中,其中也会用到上面的表达式代码生成。和前面介绍的表达式代码生成不一样,这个是对整个 SQL 过程进行代码生成,前面的表达式代码生成仅对于表达式的。全阶段代码生成都是继承自 org.apache.spark.sql.execution.BufferedRowIterator 的,生成的代码需要实现 processNext() 方法,这个方法会在 org.apache.spark.sql.execution.WholeStageCodegenExec 里面的 doExecute 方法里面被调用。而这个方法里面的 rdd 会将数据传进生成的代码里面 ,比如我们上文 SQL 这个例子的数据源是 csv 文件,底层使用 org.apache.spark.sql.execution.FileSourceScanExec 这个类读取文件,然后生成 inputRDD,这个 rdd 在 WholeStageCodegenExec 类中的 doExecute 方法里面调用生成的代码,然后执行我们各种判断得到最后的结果。WholeStageCodegenExec 类中的 doExecute 方法部分代码如下:

// rdds 可以从 FileSourceScanExec 的 inputRDDs 方法获取

val rdds = child.asInstanceOf[CodegenSupport].inputRDDs()

 

......

 

rdds.head.mapPartitionsWithIndex { (index, iter) =>

    // 编译生成好的代码

    val (clazz, _) = CodeGenerator.compile(cleanedSource)

    // 前面说了所有生成的代码都是继承自 BufferedRowIterator

    val buffer = clazz.generate(references).asInstanceOf[BufferedRowIterator]

    // 调用生成代码的 init 方法,主要传入 iter 迭代器,这里面就是我们要的数据

    buffer.init(index, Array(iter))

    new Iterator[InternalRow] {

      override def hasNext: Boolean = {

        // 这个会调用生成的代码中 processNext() 方法,这里面就会感觉表达式对每行数据进行判断

        val v = buffer.hasNext

        if (!v) durationMs += buffer.durationMs()

        v

      }

      override def next: InternalRow = buffer.next()

    }

 

......

那么我们生成的代码长什么样呢?我们还是对前面文章的 SQL 进行分析,这个 SQL 生成的物理计划如下:

== Physical Plan ==

*(3) HashAggregate(keys=[], functions=[sum(cast(v#16 as bigint))], output=[sum(v)#22L])

+- Exchange SinglePartition

   +- *(2) HashAggregate(keys=[], functions=[partial_sum(cast(v#16 as bigint))], output=[sum#24L])

      +- *(2) Project [(3 + value#1) AS v#16]

         +- *(2) BroadcastHashJoin [id#0], [id#8], Inner, BuildRight

            :- *(2) Project [id#0, value#1]

            :  +- *(2) Filter (((((isnotnull(cid#2) && isnotnull(did#3)) && (cid#2 = 1)) && (did#3 = 2)) && (id#0 > 5)) && isnotnull(id#0))

            :     +- *(2) FileScan csv [id#0,value#1,cid#2,did#3] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/iteblog/t1.csv], PartitionFilters: [], PushedFilters: [IsNotNull(cid), IsNotNull(did), EqualTo(cid,1), EqualTo(did,2), GreaterThan(id,5), IsNotNull(id)], ReadSchema: struct<id:int,value:int,cid:int,did:int>

            +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)))

               +- *(1) Project [id#8]

                  +- *(1) Filter (isnotnull(id#8) && (id#8 > 5))

                     +- *(1) FileScan csv [id#8] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/iteblog/t2.csv], PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,5)], ReadSchema: struct<id:int>

从上面的物理计划可以看出,整个 SQL 的执行分为三个阶段。为了简便起见,我们仅仅分析第一个阶段的代码生成,也就是下面物理计划:

+- *(1) Project [id#8]

   +- *(1) Filter (isnotnull(id#8) && (id#8 > 5))

      +- *(1) FileScan csv [id#8] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/iteblog/t2.csv], PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,5)], ReadSchema: struct<id:int>

通过全阶段代码生成,上面过程得到的代码如下:

Generated code:

/* 001 */ public Object generate(Object[] references) {

/* 002 */   return new GeneratedIteratorForCodegenStage1(references);

/* 003 */ }

/* 004 */

/* 005 */ // codegenStageId=1

/* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator {

/* 007 */   private Object[] references;

/* 008 */   private scala.collection.Iterator[] inputs;

/* 009 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] filter_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2];

/* 010 */   private scala.collection.Iterator[] scan_mutableStateArray_0 = new scala.collection.Iterator[1];

/* 011 */

/* 012 */   public GeneratedIteratorForCodegenStage1(Object[] references) {

/* 013 */     this.references = references;

/* 014 */   }

/* 015 */

/* 016 */   public void init(int index, scala.collection.Iterator[] inputs) {   //在WholeStageCodegenExec类中的doExecute被调用

/* 017 */     partitionIndex = index;

/* 018 */     this.inputs = inputs;

/* 019 */     scan_mutableStateArray_0[0] = inputs[0];

/* 020 */     filter_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);

/* 021 */     filter_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);

/* 022 */

/* 023 */   }

/* 024 */

/* 025 */   protected void processNext() throws java.io.IOException {  //处理每行数据,这个就是isnotnull(id#8) && (id#8 > 5)表达式的判断

/* 026 */     while (scan_mutableStateArray_0[0].hasNext()) {

/* 027 */       InternalRow scan_row_0 = (InternalRow) scan_mutableStateArray_0[0].next();

/* 028 */       ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(1);

/* 029 */       do {

/* 030 */         boolean scan_isNull_0 = scan_row_0.isNullAt(0);     //判断id是否为空

/* 031 */         int scan_value_0 = scan_isNull_0 ?                  //如果为空则scan_value_0等于-1,否则就是id的值

/* 032 */         -1 : (scan_row_0.getInt(0));

/* 033 */

/* 034 */         if (!(!scan_isNull_0)) continue;                   //如果id为空这行数据就不要了

/* 035 */

/* 036 */         boolean filter_value_2 = false;

/* 037 */         filter_value_2 = scan_value_0 > 5;                 //id是否大于5

/* 038 */         if (!filter_value_2) continue;                     //如果id不大于5,则这行数据不要了

/* 039 */

/* 040 */         ((org.apache.spark.sql.execution.metric.SQLMetric) references[1] /* numOutputRows */).add(1);

/* 041 */

/* 042 */         filter_mutableStateArray_0[1].reset();

/* 043 */

/* 044 */         if (false) {

/* 045 */           filter_mutableStateArray_0[1].setNullAt(0);

/* 046 */         } else {

/* 047 */           filter_mutableStateArray_0[1].write(0, scan_value_0);  //这个就是符合isnotnull(id#8) && (id#8 > 5)表达式的id

/* 048 */         }

/* 049 */         append((filter_mutableStateArray_0[1].getRow()));        //得到符号条件的行

/* 050 */

/* 051 */       } while(false);

/* 052 */       if (shouldStop()) return;

/* 053 */     }

/* 054 */   }

/* 055 */

/* 056 */ }

上面代码逻辑很好理解,大部分代码我都注释了,其实就是对每行的 id 进行 isnotnull(id#8) && (id#8 > 5) 表达式判断,然后拿到符合条件的行。剩余的其他阶段的代码生成和这个类似,生成的代码有点多,我就不贴出来了,感兴趣的同学可以自己去看下。相比 Volcano Iterator Model,全阶段代码生成的执行过程如下:

In a SQL Spark trip
如果想及时了解Spark、Hadoop或者HBase相关的文章,欢迎关注微信公众号:iteblog_hadoop

通过引入全阶段代码生成,大大减少了虚函数的调用,减少了 CPU 的调用,使得 SQL 的执行速度有很大提升。

代码编译

生成代码之后需要解决的另一个问题是如何将生成的代码进行编译然后加载到同一个 JVM 中去。在早期 Spark 版本是使用 Scala 的 Reflection 和 Quasiquotes 机制来实现代码生成的。Quasiquotes 是一个简洁的符号,可以让我们轻松操作 Scala 语法树,具体参见 这里。虽然 Quasiquotes 可以很好的为我们解决代码生成等相关的问题,但是带来的新问题是编译代码时间比较长(大约 50ms - 500ms)!所以社区不得不默认关闭表达式代码生成。

To solve this problem, Spark introduces Janino project, see  SPARK-7956 . Janino is a small but super super fast Java ™ compiler. It is not only that a set of source files are compiled into bytecode files like javac tool, also some Java expression, block, type text ( class body) or a memory source files are compiled and the compiled bytecode directly loadable into the same running in the JVM. Janino not a development tool, but as an embedded run-time compiler, such as expression evaluation translators or similar server-side JSP page engine, more about Janino, see here . By introducing the generated code to compile Janino showed reduced time to compile the SQL expression 5ms. Spark used in the  ClassBodyEvaluator code following be compiled, see org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.

The main need is, code generation is carried out in Driver side, and the code is compiled in the Executor end.

SQL execution

And finally to the real implementation of the local SQL. This time will be executed on the stage Spark generated code, and then get the final result, the DAG executes the following:

In a SQL Spark trip
If you want to keep abreast of article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public number: iteblog_hadoop

Reprinted from past memories (https://www.iteblog.com/)
This link:  [a SQL in Apache Spark trip (next)] (https://www.iteblog.com/archives/2563.html)

 

Guess you like

Origin blog.csdn.net/BD_fuhong/article/details/94756495