Spark源码--逻辑计划优化之表达式简化

版权声明:https://github.com/wusuopubupt https://blog.csdn.net/wusuopuBUPT/article/details/76162495


一、常量合并(Constant Folding)

替换可以被静态计算的表达式

例如sql: 


select  1+2+3  from  t1


优化过程:


scala> sqlContext.sql( "select 1+2+3 from t1" )
17/07/25 16:50:21 INFO parse.ParseDriver: Parsing command:  select  1+2+3  from  t1
17/07/25 16:50:21 INFO parse.ParseDriver: Parse Completed
res27: org.apache.spark.sql.DataFrame = [_c0:  int ]
 
scala> res27.queryExecution
res28: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias(((1 + 2) + 3))]
+- ' UnresolvedRelation `t1`, None
 
== Analyzed Logical Plan ==
_c0:  int
Project [((1 + 2) + 3)  AS  _c0#19]
+- Subquery t1
    +- Project [_1#0  AS  name #5,_2#1  AS  date #6,_3#2  AS  cate#7,_4#3  AS  amountSpent#8,_5#4  AS  time #9]
       +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]  at  rddToDataFrameHolder  at  <console>:27
 
== Optimized Logical Plan ==
Project [6  AS  _c0#19]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]  at  rddToDataFrameHolder  at  <console>:27
 
== Physical Plan ==
Project [6  AS  _c0#19]
+- Scan ExistingRDD[_1#0,_2#1,_3#2,_4#3,_5#4]

可见经过优化后,逻辑计划里的project转化成了6(1+2+3的结果),物理计划直接返回6

实现代码如下:

/**
   * 替换可以被静态计算的表达式
   */
object  ConstantFolding  extends  Rule[LogicalPlan] {
   def  apply(plan :  LogicalPlan) :  LogicalPlan  =  plan transform {
     case  q :  LogicalPlan  = > q transformExpressionsDown {  // 对计划的表达式执行转化操作
       // 如果是字面量,直接返回,避免对字面量的重复计算(因为Literal也是foldable的)
       case  l :  Literal  = > l
       // 调用eval方法合并foldable的表达式,返回字面量
       case  if  e.foldable  = > Literal.create(e.eval(EmptyRow), e.dataType)
     }
   }
}


二、简化过滤器 (Simlify Filters)

 如果过滤器一直返回true, 则删掉此过滤器(如:where 2>1)
 如果过滤器一直返回false, 则直接让计划返回空(如: where 2<1)

例如sql: 

select  name  from  t1  where  2 > 1
优化过程:
scala> sqlContext.sql( "select name from t1 where 2 > 1" )
17/07/25 15:50:25 INFO parse.ParseDriver: Parsing command:  select  name  from  t1  where  2 > 1
17/07/25 15:50:25 INFO parse.ParseDriver: Parse Completed
res23: org.apache.spark.sql.DataFrame = [ name : string]
 
scala> res23.queryExecution
res24: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias(' name )]
+-  'Filter (2 > 1)
    +- ' UnresolvedRelation `t1`, None
 
== Analyzed Logical Plan ==
name : string
Project [ name #5]
+- Filter (2 > 1)
    +- Subquery t1
       +- Project [_1#0  AS  name #5,_2#1  AS  date #6,_3#2  AS  cate#7,_4#3  AS  amountSpent#8,_5#4  AS  time #9]
          +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]  at  rddToDataFrameHolder  at  <console>:27
 
== Optimized Logical Plan ==
Project [_1#0  AS  name #5]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]  at  rddToDataFrameHolder  at  <console>:27
 
== Physical Plan ==
Project [_1#0  AS  name #5]
+- Scan ExistingRDD[_1#0,_2#1,_3#2,_4#3,_5#4]

可见经过优化后,逻辑计划里的的 2 > 1这个恒为true的过滤器被删除了

实现代码如下:

object  SimplifyFilters  extends  Rule[LogicalPlan] {
   def  apply(plan :  LogicalPlan) :  LogicalPlan  =  plan transform {
     // If the filter condition always evaluate to true, remove the filter.
     case  Filter(Literal( true , BooleanType), child)  = > child
     // If the filter condition always evaluate to null or false,
     // replace the input with an empty relation.
     case  Filter(Literal( null _ ), child)  = > LocalRelation(child.output, data  =  Seq.empty)
     case  Filter(Literal( false , BooleanType), child)  = > LocalRelation(child.output, data  =  Seq.empty)
   }
}

三、简化Cast (Simplify Casts)

如果数据类型和要转换的类型一致,则去掉Cast

例如sql: 

select  cast ( name  as  String)  from  t1
优化过程:
//  name 本身就是String类型
scala> sqlContext.sql( "select cast(name as String) from t1" )
17/07/25 16:59:44 INFO parse.ParseDriver: Parsing command:  select  cast ( name  as  String)  from  t1
17/07/25 16:59:44 INFO parse.ParseDriver: Parse Completed
res29: org.apache.spark.sql.DataFrame = [ name : string]
 
scala> res29.queryExecution
res30: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias(cast(' name  as  string))]
+- 'UnresolvedRelation `t1`, None
 
== Analyzed Logical Plan ==
name : string
Project [ cast ( name #5  as  string)  AS  name #20]
+- Subquery t1
    +- Project [_1#0  AS  name #5,_2#1  AS  date #6,_3#2  AS  cate#7,_4#3  AS  amountSpent#8,_5#4  AS  time #9]
       +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]  at  rddToDataFrameHolder  at  <console>:27
 
== Optimized Logical Plan ==
// 去掉了无用的 cast
Project [_1#0  AS  name #20]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]  at  rddToDataFrameHolder  at  <console>:27
 
== Physical Plan ==
Project [_1#0  AS  name #20]
+- Scan ExistingRDD[_1#0,_2#1,_3#2,_4#3,_5#4]

由于name本身就是String类型,所以优化器把cast to String这个表达式给优化删除了。

实现代码如下:

object  SimplifyCasts  extends  Rule[LogicalPlan] {
   def  apply(plan :  LogicalPlan) :  LogicalPlan  =  plan transformAllExpressions {
     case  Cast(e, dataType)  if  e.dataType  ==  dataType  = > e
   }
}

四、简化大小写转化表达式 (Simplify Case Conversion Expressions)

对于嵌套大小写转化表达式,以最外层为准,去掉里层的转化表达式

例如sql: 

select  upper ( lower ( name ))  from  t1
优化过程:
scala> sqlContext.sql( "select upper(lower(name)) from t1" )
17/07/25 17:13:01 INFO parse.ParseDriver: Parsing command:  select  upper ( lower ( name ))  from  t1
17/07/25 17:13:01 INFO parse.ParseDriver: Parse Completed
res34: org.apache.spark.sql.DataFrame = [_c0: string]
 
scala> res34.queryExecution
res35: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias(' upper ( 'lower(' name )))]
+- 'UnresolvedRelation `t1`, None
 
== Analyzed Logical Plan ==
_c0: string
Project [ upper ( lower ( name #5))  AS  _c0#22]
+- Subquery t1
    +- Project [_1#0  AS  name #5,_2#1  AS  date #6,_3#2  AS  cate#7,_4#3  AS  amountSpent#8,_5#4  AS  time #9]
       +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]  at  rddToDataFrameHolder  at  <console>:27
 
== Optimized Logical Plan ==
// 只剩下最外层的 upper 方法
Project [ upper (_1#0)  AS  _c0#22]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]  at  rddToDataFrameHolder  at  <console>:27
 
== Physical Plan ==
Project [ upper (_1#0)  AS  _c0#22]
+- Scan ExistingRDD[_1#0,_2#1,_3#2,_4#3,_5#4]

经过优化后,只剩下最外层的大小写转化方法,相当于执行: select upper(name) from t1

实现代码如下:

object  SimplifyCaseConversionExpressions  extends  Rule[LogicalPlan] {
   def  apply(plan :  LogicalPlan) :  LogicalPlan  =  plan transform {
     case  q :  LogicalPlan  = > q transformExpressionsUp {
       // 以最外层转化表达式为准,其余删掉
       case  Upper(Upper(child))  = > Upper(child)
       case  Upper(Lower(child))  = > Upper(child)
       case  Lower(Upper(child))  = > Lower(child)
       case  Lower(Lower(child))  = > Lower(child)
     }
   }
}

五、优化In语句 (Optimize In)

把In List优化为In Set

例如sql: 

select  from  t1  where  id  in  (1,1,2,2,1,2,1,2,2,2,2,2)
经过优化后相当于执行(注意:在Spark-1.6.2实验环境下没看出优化!): 
select  from  t1  where  id  in  (1,2)

实现代码如下:

/**
   * Replaces [[In (value, seq[Literal])]] with optimized version[[InSet (value, HashSet[Literal])]]
   * which is much faster
   */
object  OptimizeIn  extends  Rule[LogicalPlan] {
   def  apply(plan :  LogicalPlan) :  LogicalPlan  =  plan transform {
     case  q :  LogicalPlan  = > q transformExpressionsDown {
       case  In(v, list)  if  !list.exists(! _ .isInstanceOf[Literal]) && list.size >  10  = >
         val  hSet  =  list.map(e  = > e.eval(EmptyRow))
         InSet(v, HashSet() ++ hSet)
     }
   }
}
}

六、简化Like语句(Simplify Like)

对一下几种场景的正则表达式做了优化:

startsWith:    'abc%'
endsWith:     '%abc'
contains:      '%abc%'
equalTo:       'abc'

例如sql: 

select  name  from  t1  where  name  like  'Bo%'
并不会以正则表达式匹配的方式执行,优化过程:
scala> sqlContext.sql( "select name from t1 where name like 'B%'" )
17/07/25 18:25:04 INFO parse.ParseDriver: Parsing command:  select  name  from  t1  where  name  like  'B%'
17/07/25 18:25:04 INFO parse.ParseDriver: Parse Completed
res46: org.apache.spark.sql.DataFrame = [ name : string]
 
scala> res46.queryExecution
res47: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias(' name )]
+-  'Filter ' name  LIKE  B%
    +- 'UnresolvedRelation `t1`, None
 
== Analyzed Logical Plan ==
name : string
Project [ name #5]
+- Filter  name #5  LIKE  B%
    +- Subquery t1
       +- Project [_1#0  AS  name #5,_2#1  AS  date #6,_3#2  AS  cate#7,_4#3  AS  amountSpent#8,_5#4  AS  time #9]
          +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]  at  rddToDataFrameHolder  at  <console>:27
 
== Optimized Logical Plan ==
Project [_1#0  AS  name #5]
+- Filter StartsWith(_1#0, B) // 优化为字符串的startWith()
    +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]  at  rddToDataFrameHolder  at  <console>:27
 
== Physical Plan ==
Project [_1#0  AS  name #5]
+- Filter StartsWith(_1#0, B)
    +- Scan ExistingRDD[_1#0,_2#1,_3#2,_...

经过优化后,原始输入的正则表达式转化为字符串的startWith()操作

实现代码如下:

/**
   * 简化不需要使用正则表达式匹配的like语句
   */
object  LikeSimplification  extends  Rule[LogicalPlan] {
   // if guards below protect from escapes on trailing %.
   // Cases like "something\%" are not optimized, but this does not affect correctness.
   private  val  startsWith  =  "([^_%]+)%" .r   // 'abc%'
   private  val  endsWith  =  "%([^_%]+)" .r     // '%abc'
   private  val  contains  =  "%([^_%]+)%" .r    // '%abc%'
   private  val  equalTo  =  "([^_%]*)" .r       // 'abc'
 
   def  apply(plan :  LogicalPlan) :  LogicalPlan  =  plan transformAllExpressions {
     case  Like(l, Literal(utf, StringType))  = >
       utf.toString  match  {
         case  startsWith(pattern)  if  !pattern.endsWith( "\\" = >
           StartsWith(l, Literal(pattern))  // 字符串的startWith()
         case  endsWith(pattern)  = >
           EndsWith(l, Literal(pattern))    // 字符串的endWith()
         case  contains(pattern)  if  !pattern.endsWith( "\\" = >
           Contains(l, Literal(pattern))    // 通过字节码检查包含
         case  equalTo(pattern)  = >
           EqualTo(l, Literal(pattern))     // 字符串的=操作
         case  _  = >
           Like(l, Literal.create(utf, StringType))
       }
   }
}


七、替换Null表达式 (Null Propagation)

在某些特定场景下替换null表达式为字面量,阻止NULL表达式传播

例如sql: 

select  count ( null from  t1
优化过程:
scala> sqlContext.sql( "select count(null) from t1" )
17/07/26 11:40:18 INFO parse.ParseDriver: Parsing command:  select  count ( null from  t1
17/07/26 11:40:18 INFO parse.ParseDriver: Parse Completed
res8: org.apache.spark.sql.DataFrame = [_c0:  bigint ]
 
scala> res8.queryExecution
res10: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias(' count ( null ))]
+- 'UnresolvedRelation `t1`, None
 
== Analyzed Logical Plan ==
_c0:  bigint
Aggregate [( count ( null ),mode=Complete,isDistinct= false AS  _c0#10L]
+- Subquery t1
    +- Project [_1#0  AS  name #5,_2#1  AS  date #6,_3#2  AS  cate#7,_4#3  AS  amountSpent#8,_5#4  AS  time #9]
       +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]  at  rddToDataFrameHolder  at  <console>:27
 
== Optimized Logical Plan ==
// 直接返回0
Aggregate [0  AS  _c0#10L]
+- Project
    +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]  at  rddToDataFrameHolder  at  <console>:27

经过优化后,逻辑计划里的对count(null)优化后直接返回0, 不必全表扫描

实现代码如下:

object  NullPropagation  extends  Rule[LogicalPlan] { 
   def  apply(plan :  LogicalPlan) :  LogicalPlan  =  plan transform { 
     case  q :  LogicalPlan  = > q transformExpressionsUp { 
case  @  Count(Literal( null _ ))  = > Cast(Literal( 0 L), e.dataType) //如果count(null)则转化为count(0) 
case  e @ AggregateExpression(Count(exprs),  _ _ if  !exprs.exists(nonNullLiteral)  = >
         Cast(Literal( 0 L), e.dataType)
     case  e @ IsNull(c)  if  !c.nullable  = > Literal.create( false , BooleanType)
     case  e @ IsNotNull(c)  if  !c.nullable  = > Literal.create( true , BooleanType)
     case  ...
}

八、简化布尔表达式 (Boolean Simplification)

如果布尔表达式是通过逻辑门(and、or、not)等连接起来的,则根据逻辑门的特性做优化(如 true && a > 1 可以优化为 a > 1, true || a > 1可以优化为true)

例如sql: 

select  name  from  t1  where  2 > 1  and  time  > 1
优化过程:
scala> sqlContext.sql( "select name from t1 where 2 > 1 and time > 1" )
17/07/26 12:10:17 INFO parse.ParseDriver: Parsing command:  select  name  from  t1  where  2 > 1  and  time  > 1
17/07/26 12:10:17 INFO parse.ParseDriver: Parse Completed
res26: org.apache.spark.sql.DataFrame = [ name : string]
 
scala> res26.queryExecution
res28: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias(' name )]
+-  'Filter ((2 > 1) && (' time  > 1))
    +- 'UnresolvedRelation `t1`, None
 
== Analyzed Logical Plan ==
name : string
Project [ name #5]
+- Filter ((2 > 1) && ( time #9 > 1))
    +- Subquery t1
       +- Project [_1#0  AS  name #5,_2#1  AS  date #6,_3#2  AS  cate#7,_4#3  AS  amountSpent#8,_5#4  AS  time #9]
          +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]  at  rddToDataFrameHolder  at  <console>:27
 
== Optimized Logical Plan ==
Project [_1#0  AS  name #5]
// 2 > 1 恒为 true , 此筛选条件在&&情况下
+- Filter (_5#4 > 1)
    +- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]  at  rddToDataFrameHolder  at  <console>:27

经过优化后,逻辑计划里 2 > 1 这个恒为true的布尔表达式,在and操作符情况下被优化去掉了

实现代码如下:

/**
   * Simplifies boolean expressions:
   * 1. Simplifies expressions whose answer can be determined without evaluating both sides.
   * 2. Eliminates / extracts common factors.
   * 3. Merge same expressions
   * 4. Removes `Not` operator.
   */
object  BooleanSimplification  extends  Rule[LogicalPlan]  with  PredicateHelper {
   def  apply(plan :  LogicalPlan) :  LogicalPlan  =  plan transform {
     case  q :  LogicalPlan  = > q transformExpressionsUp {
       // and操作符的优化,如果有true的过滤器,在and条件下可以消除,如果有false,直接返回false
       case  and @ And(left, right)  = > (left, right)  match  {
         // true && r  =>  r
         case  (Literal( true , BooleanType), r)  = > r
         // l && true  =>  l
         case  (l, Literal( true , BooleanType))  = > l
         // false && r  =>  false
         case  (Literal( false , BooleanType),  _ = > Literal( false )
         // l && false  =>  false
         case  ( _ , Literal( false , BooleanType))  = > Literal( false )
         // a && a  =>  a
         case  (l, r)  if  l fastEquals r  = > l
         // a && (not(a) || b) => a && b
         case  (l, Or(l 1 , r))  if  (Not(l)  ==  l 1 = > And(l, r)
         case  (l, Or(r, l 1 ))  if  (Not(l)  ==  l 1 = > And(l, r)
         case  (Or(l, l 1 ), r)  if  (l 1  ==  Not(r))  = > And(l, r)
         case  (Or(l 1 , l), r)  if  (l 1  ==  Not(r))  = > And(l, r)
         // (a || b) && (a || c)  =>  a || (b && c)
         case  ...
       // end of And(left, right)
 
       // or操作符的优化,短路原则
       case  or @ Or(left, right)  = > (left, right)  match  {
         // true || r  =>  true, 有一个为true就返回true
         case  (Literal( true , BooleanType),  _ = > Literal( true )
         // r || true  =>  true
         case  ( _ , Literal( true , BooleanType))  = > Literal( true )
         // false || r  =>  r
         case  (Literal( false , BooleanType), r)  = > r
         // l || false  =>  l
         case  (l, Literal( false , BooleanType))  = > l
         // a || a => a
         case  (l, r)  if  l fastEquals r  = > l
         // (a && b) || (a && c)  =>  a && (b || c)
         case  ...
       // end of Or(left, right)
       
        // 消除Not操作符, 直接取反义
       case  not @ Not(exp)  = > exp  match  {
         // not(true)  =>  false, true的反义是false
         case  Literal( true , BooleanType)  = > Literal( false )
         // not(false)  =>  true
         case  Literal( false , BooleanType)  = > Literal( true )
         // not(l > r)  =>  l <= r
         case  GreaterThan(l, r)  = > LessThanOrEqual(l, r)
         // not(l >= r)  =>  l < r
         case  GreaterThanOrEqual(l, r)  = > LessThan(l, r)
         // not(l < r)  =>  l >= r
         case  LessThan(l, r)  = > GreaterThanOrEqual(l, r)
         // not(l <= r)  =>  l > r
         case  LessThanOrEqual(l, r)  = > GreaterThan(l, r)
         // not(l || r) => not(l) && not(r)
         case  Or(l, r)  = > And(Not(l), Not(r))
         // not(l && r) => not(l) or not(r)
         case  And(l, r)  = > Or(Not(l), Not(r))
         // not(not(e))  =>  e
         case  Not(e)  = > e
         case  _  = > not
       // end of Not(exp)
 
       // if (true) a else b  =>  a
       // if (false) a else b  =>  b
       case  e @ If(Literal(v,  _ ), trueValue, falseValue)  = if  (v  ==  true ) trueValue  else  falseValue
     }
   }
}

猜你喜欢

转载自blog.csdn.net/wusuopuBUPT/article/details/76162495