一、常量合并(Constant Folding)
替换可以被静态计算的表达式
例如sql:
select
1+2+3
from
t1
|
优化过程:
scala> sqlContext.sql(
"select 1+2+3 from t1"
)
17/07/25 16:50:21 INFO parse.ParseDriver: Parsing command:
select
1+2+3
from
t1
17/07/25 16:50:21 INFO parse.ParseDriver: Parse Completed
res27: org.apache.spark.sql.DataFrame = [_c0:
int
]
scala> res27.queryExecution
res28: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias(((1 + 2) + 3))]
+- '
UnresolvedRelation `t1`, None
== Analyzed Logical Plan ==
_c0:
int
Project [((1 + 2) + 3)
AS
_c0#19]
+- Subquery t1
+- Project [_1#0
AS
name
#5,_2#1
AS
date
#6,_3#2
AS
cate#7,_4#3
AS
amountSpent#8,_5#4
AS
time
#9]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]
at
rddToDataFrameHolder
at
<console>:27
== Optimized Logical Plan ==
Project [6
AS
_c0#19]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]
at
rddToDataFrameHolder
at
<console>:27
== Physical Plan ==
Project [6
AS
_c0#19]
+- Scan ExistingRDD[_1#0,_2#1,_3#2,_4#3,_5#4]
|
可见经过优化后,逻辑计划里的project转化成了6(1+2+3的结果),物理计划直接返回6
实现代码如下:
/**
* 替换可以被静态计算的表达式
*/
object
ConstantFolding
extends
Rule[LogicalPlan] {
def
apply(plan
:
LogicalPlan)
:
LogicalPlan
=
plan transform {
case
q
:
LogicalPlan
=
> q transformExpressionsDown {
// 对计划的表达式执行转化操作
// 如果是字面量,直接返回,避免对字面量的重复计算(因为Literal也是foldable的)
case
l
:
Literal
=
> l
// 调用eval方法合并foldable的表达式,返回字面量
case
e
if
e.foldable
=
> Literal.create(e.eval(EmptyRow), e.dataType)
}
}
}
|
二、简化过滤器 (Simlify Filters)
如果过滤器一直返回true, 则删掉此过滤器(如:where 2>1)
如果过滤器一直返回false, 则直接让计划返回空(如: where 2<1)
例如sql:
select
name
from
t1
where
2 > 1
|
优化过程:
scala> sqlContext.sql(
"select name from t1 where 2 > 1"
)
17/07/25 15:50:25 INFO parse.ParseDriver: Parsing command:
select
name
from
t1
where
2 > 1
17/07/25 15:50:25 INFO parse.ParseDriver: Parse Completed
res23: org.apache.spark.sql.DataFrame = [
name
: string]
scala> res23.queryExecution
res24: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias('
name
)]
+-
'Filter (2 > 1)
+- '
UnresolvedRelation `t1`, None
== Analyzed Logical Plan ==
name
: string
Project [
name
#5]
+- Filter (2 > 1)
+- Subquery t1
+- Project [_1#0
AS
name
#5,_2#1
AS
date
#6,_3#2
AS
cate#7,_4#3
AS
amountSpent#8,_5#4
AS
time
#9]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]
at
rddToDataFrameHolder
at
<console>:27
== Optimized Logical Plan ==
Project [_1#0
AS
name
#5]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]
at
rddToDataFrameHolder
at
<console>:27
== Physical Plan ==
Project [_1#0
AS
name
#5]
+- Scan ExistingRDD[_1#0,_2#1,_3#2,_4#3,_5#4]
|
可见经过优化后,逻辑计划里的的 2 > 1这个恒为true的过滤器被删除了
实现代码如下:
object
SimplifyFilters
extends
Rule[LogicalPlan] {
def
apply(plan
:
LogicalPlan)
:
LogicalPlan
=
plan transform {
// If the filter condition always evaluate to true, remove the filter.
case
Filter(Literal(
true
, BooleanType), child)
=
> child
// If the filter condition always evaluate to null or false,
// replace the input with an empty relation.
case
Filter(Literal(
null
,
_
), child)
=
> LocalRelation(child.output, data
=
Seq.empty)
case
Filter(Literal(
false
, BooleanType), child)
=
> LocalRelation(child.output, data
=
Seq.empty)
}
}
|
三、简化Cast (Simplify Casts)
如果数据类型和要转换的类型一致,则去掉Cast
例如sql:
select
cast
(
name
as
String)
from
t1
|
优化过程:
//
name
本身就是String类型
scala> sqlContext.sql(
"select cast(name as String) from t1"
)
17/07/25 16:59:44 INFO parse.ParseDriver: Parsing command:
select
cast
(
name
as
String)
from
t1
17/07/25 16:59:44 INFO parse.ParseDriver: Parse Completed
res29: org.apache.spark.sql.DataFrame = [
name
: string]
scala> res29.queryExecution
res30: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias(cast('
name
as
string))]
+- 'UnresolvedRelation `t1`, None
== Analyzed Logical Plan ==
name
: string
Project [
cast
(
name
#5
as
string)
AS
name
#20]
+- Subquery t1
+- Project [_1#0
AS
name
#5,_2#1
AS
date
#6,_3#2
AS
cate#7,_4#3
AS
amountSpent#8,_5#4
AS
time
#9]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]
at
rddToDataFrameHolder
at
<console>:27
== Optimized Logical Plan ==
// 去掉了无用的
cast
Project [_1#0
AS
name
#20]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]
at
rddToDataFrameHolder
at
<console>:27
== Physical Plan ==
Project [_1#0
AS
name
#20]
+- Scan ExistingRDD[_1#0,_2#1,_3#2,_4#3,_5#4]
|
由于name本身就是String类型,所以优化器把cast to String这个表达式给优化删除了。
实现代码如下:
object
SimplifyCasts
extends
Rule[LogicalPlan] {
def
apply(plan
:
LogicalPlan)
:
LogicalPlan
=
plan transformAllExpressions {
case
Cast(e, dataType)
if
e.dataType
==
dataType
=
> e
}
}
|
四、简化大小写转化表达式 (Simplify Case Conversion Expressions)
对于嵌套大小写转化表达式,以最外层为准,去掉里层的转化表达式
例如sql:
select
upper
(
lower
(
name
))
from
t1
|
优化过程:
scala> sqlContext.sql(
"select upper(lower(name)) from t1"
)
17/07/25 17:13:01 INFO parse.ParseDriver: Parsing command:
select
upper
(
lower
(
name
))
from
t1
17/07/25 17:13:01 INFO parse.ParseDriver: Parse Completed
res34: org.apache.spark.sql.DataFrame = [_c0: string]
scala> res34.queryExecution
res35: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias('
upper
(
'lower('
name
)))]
+- 'UnresolvedRelation `t1`, None
== Analyzed Logical Plan ==
_c0: string
Project [
upper
(
lower
(
name
#5))
AS
_c0#22]
+- Subquery t1
+- Project [_1#0
AS
name
#5,_2#1
AS
date
#6,_3#2
AS
cate#7,_4#3
AS
amountSpent#8,_5#4
AS
time
#9]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]
at
rddToDataFrameHolder
at
<console>:27
== Optimized Logical Plan ==
// 只剩下最外层的
upper
方法
Project [
upper
(_1#0)
AS
_c0#22]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]
at
rddToDataFrameHolder
at
<console>:27
== Physical Plan ==
Project [
upper
(_1#0)
AS
_c0#22]
+- Scan ExistingRDD[_1#0,_2#1,_3#2,_4#3,_5#4]
|
经过优化后,只剩下最外层的大小写转化方法,相当于执行: select upper(name) from t1
实现代码如下:
object
SimplifyCaseConversionExpressions
extends
Rule[LogicalPlan] {
def
apply(plan
:
LogicalPlan)
:
LogicalPlan
=
plan transform {
case
q
:
LogicalPlan
=
> q transformExpressionsUp {
// 以最外层转化表达式为准,其余删掉
case
Upper(Upper(child))
=
> Upper(child)
case
Upper(Lower(child))
=
> Upper(child)
case
Lower(Upper(child))
=
> Lower(child)
case
Lower(Lower(child))
=
> Lower(child)
}
}
}
|
五、优化In语句 (Optimize In)
把In List优化为In Set
例如sql:
select
*
from
t1
where
id
in
(1,1,2,2,1,2,1,2,2,2,2,2)
|
经过优化后相当于执行(注意:在Spark-1.6.2实验环境下没看出优化!):
select
*
from
t1
where
id
in
(1,2)
|
实现代码如下:
/**
* Replaces [[In (value, seq[Literal])]] with optimized version[[InSet (value, HashSet[Literal])]]
* which is much faster
*/
object
OptimizeIn
extends
Rule[LogicalPlan] {
def
apply(plan
:
LogicalPlan)
:
LogicalPlan
=
plan transform {
case
q
:
LogicalPlan
=
> q transformExpressionsDown {
case
In(v, list)
if
!list.exists(!
_
.isInstanceOf[Literal]) && list.size >
10
=
>
val
hSet
=
list.map(e
=
> e.eval(EmptyRow))
InSet(v, HashSet() ++ hSet)
}
}
}
}
|
六、简化Like语句(Simplify Like)
对一下几种场景的正则表达式做了优化:
startsWith: 'abc%'
endsWith: '%abc'
contains: '%abc%'
equalTo: 'abc'
例如sql:
select
name
from
t1
where
name
like
'Bo%'
|
并不会以正则表达式匹配的方式执行,优化过程:
scala> sqlContext.sql(
"select name from t1 where name like 'B%'"
)
17/07/25 18:25:04 INFO parse.ParseDriver: Parsing command:
select
name
from
t1
where
name
like
'B%'
17/07/25 18:25:04 INFO parse.ParseDriver: Parse Completed
res46: org.apache.spark.sql.DataFrame = [
name
: string]
scala> res46.queryExecution
res47: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias('
name
)]
+-
'Filter '
name
LIKE
B%
+- 'UnresolvedRelation `t1`, None
== Analyzed Logical Plan ==
name
: string
Project [
name
#5]
+- Filter
name
#5
LIKE
B%
+- Subquery t1
+- Project [_1#0
AS
name
#5,_2#1
AS
date
#6,_3#2
AS
cate#7,_4#3
AS
amountSpent#8,_5#4
AS
time
#9]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]
at
rddToDataFrameHolder
at
<console>:27
== Optimized Logical Plan ==
Project [_1#0
AS
name
#5]
+- Filter StartsWith(_1#0, B) // 优化为字符串的startWith()
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]
at
rddToDataFrameHolder
at
<console>:27
== Physical Plan ==
Project [_1#0
AS
name
#5]
+- Filter StartsWith(_1#0, B)
+- Scan ExistingRDD[_1#0,_2#1,_3#2,_...
|
经过优化后,原始输入的正则表达式转化为字符串的startWith()操作
实现代码如下:
/**
* 简化不需要使用正则表达式匹配的like语句
*/
object
LikeSimplification
extends
Rule[LogicalPlan] {
// if guards below protect from escapes on trailing %.
// Cases like "something\%" are not optimized, but this does not affect correctness.
private
val
startsWith
=
"([^_%]+)%"
.r
// 'abc%'
private
val
endsWith
=
"%([^_%]+)"
.r
// '%abc'
private
val
contains
=
"%([^_%]+)%"
.r
// '%abc%'
private
val
equalTo
=
"([^_%]*)"
.r
// 'abc'
def
apply(plan
:
LogicalPlan)
:
LogicalPlan
=
plan transformAllExpressions {
case
Like(l, Literal(utf, StringType))
=
>
utf.toString
match
{
case
startsWith(pattern)
if
!pattern.endsWith(
"\\"
)
=
>
StartsWith(l, Literal(pattern))
// 字符串的startWith()
case
endsWith(pattern)
=
>
EndsWith(l, Literal(pattern))
// 字符串的endWith()
case
contains(pattern)
if
!pattern.endsWith(
"\\"
)
=
>
Contains(l, Literal(pattern))
// 通过字节码检查包含
case
equalTo(pattern)
=
>
EqualTo(l, Literal(pattern))
// 字符串的=操作
case
_
=
>
Like(l, Literal.create(utf, StringType))
}
}
}
|
七、替换Null表达式 (Null Propagation)
在某些特定场景下替换null表达式为字面量,阻止NULL表达式传播
例如sql:
select
count
(
null
)
from
t1
|
优化过程:
scala> sqlContext.sql(
"select count(null) from t1"
)
17/07/26 11:40:18 INFO parse.ParseDriver: Parsing command:
select
count
(
null
)
from
t1
17/07/26 11:40:18 INFO parse.ParseDriver: Parse Completed
res8: org.apache.spark.sql.DataFrame = [_c0:
bigint
]
scala> res8.queryExecution
res10: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias('
count
(
null
))]
+- 'UnresolvedRelation `t1`, None
== Analyzed Logical Plan ==
_c0:
bigint
Aggregate [(
count
(
null
),mode=Complete,isDistinct=
false
)
AS
_c0#10L]
+- Subquery t1
+- Project [_1#0
AS
name
#5,_2#1
AS
date
#6,_3#2
AS
cate#7,_4#3
AS
amountSpent#8,_5#4
AS
time
#9]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]
at
rddToDataFrameHolder
at
<console>:27
== Optimized Logical Plan ==
// 直接返回0
Aggregate [0
AS
_c0#10L]
+- Project
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]
at
rddToDataFrameHolder
at
<console>:27
|
经过优化后,逻辑计划里的对count(null)优化后直接返回0, 不必全表扫描
实现代码如下:
object
NullPropagation
extends
Rule[LogicalPlan] {
def
apply(plan
:
LogicalPlan)
:
LogicalPlan
=
plan transform {
case
q
:
LogicalPlan
=
> q transformExpressionsUp {
case
e
@
Count(Literal(
null
,
_
))
=
> Cast(Literal(
0
L), e.dataType)
//如果count(null)则转化为count(0)
case
e
@
AggregateExpression(Count(exprs),
_
,
_
)
if
!exprs.exists(nonNullLiteral)
=
>
Cast(Literal(
0
L), e.dataType)
case
e
@
IsNull(c)
if
!c.nullable
=
> Literal.create(
false
, BooleanType)
case
e
@
IsNotNull(c)
if
!c.nullable
=
> Literal.create(
true
, BooleanType)
case
...
}
|
八、简化布尔表达式 (Boolean Simplification)
如果布尔表达式是通过逻辑门(and、or、not)等连接起来的,则根据逻辑门的特性做优化(如 true && a > 1 可以优化为 a > 1, true || a > 1可以优化为true)
例如sql:
select
name
from
t1
where
2 > 1
and
time
> 1
|
优化过程:
scala> sqlContext.sql(
"select name from t1 where 2 > 1 and time > 1"
)
17/07/26 12:10:17 INFO parse.ParseDriver: Parsing command:
select
name
from
t1
where
2 > 1
and
time
> 1
17/07/26 12:10:17 INFO parse.ParseDriver: Parse Completed
res26: org.apache.spark.sql.DataFrame = [
name
: string]
scala> res26.queryExecution
res28: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias('
name
)]
+-
'Filter ((2 > 1) && ('
time
> 1))
+- 'UnresolvedRelation `t1`, None
== Analyzed Logical Plan ==
name
: string
Project [
name
#5]
+- Filter ((2 > 1) && (
time
#9 > 1))
+- Subquery t1
+- Project [_1#0
AS
name
#5,_2#1
AS
date
#6,_3#2
AS
cate#7,_4#3
AS
amountSpent#8,_5#4
AS
time
#9]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]
at
rddToDataFrameHolder
at
<console>:27
== Optimized Logical Plan ==
Project [_1#0
AS
name
#5]
// 2 > 1 恒为
true
, 此筛选条件在&&情况下
+- Filter (_5#4 > 1)
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]
at
rddToDataFrameHolder
at
<console>:27
|
经过优化后,逻辑计划里 2 > 1 这个恒为true的布尔表达式,在and操作符情况下被优化去掉了
实现代码如下:
/**
* Simplifies boolean expressions:
* 1. Simplifies expressions whose answer can be determined without evaluating both sides.
* 2. Eliminates / extracts common factors.
* 3. Merge same expressions
* 4. Removes `Not` operator.
*/
object
BooleanSimplification
extends
Rule[LogicalPlan]
with
PredicateHelper {
def
apply(plan
:
LogicalPlan)
:
LogicalPlan
=
plan transform {
case
q
:
LogicalPlan
=
> q transformExpressionsUp {
// and操作符的优化,如果有true的过滤器,在and条件下可以消除,如果有false,直接返回false
case
and
@
And(left, right)
=
> (left, right)
match
{
// true && r => r
case
(Literal(
true
, BooleanType), r)
=
> r
// l && true => l
case
(l, Literal(
true
, BooleanType))
=
> l
// false && r => false
case
(Literal(
false
, BooleanType),
_
)
=
> Literal(
false
)
// l && false => false
case
(
_
, Literal(
false
, BooleanType))
=
> Literal(
false
)
// a && a => a
case
(l, r)
if
l fastEquals r
=
> l
// a && (not(a) || b) => a && b
case
(l, Or(l
1
, r))
if
(Not(l)
==
l
1
)
=
> And(l, r)
case
(l, Or(r, l
1
))
if
(Not(l)
==
l
1
)
=
> And(l, r)
case
(Or(l, l
1
), r)
if
(l
1
==
Not(r))
=
> And(l, r)
case
(Or(l
1
, l), r)
if
(l
1
==
Not(r))
=
> And(l, r)
// (a || b) && (a || c) => a || (b && c)
case
...
}
// end of And(left, right)
// or操作符的优化,短路原则
case
or
@
Or(left, right)
=
> (left, right)
match
{
// true || r => true, 有一个为true就返回true
case
(Literal(
true
, BooleanType),
_
)
=
> Literal(
true
)
// r || true => true
case
(
_
, Literal(
true
, BooleanType))
=
> Literal(
true
)
// false || r => r
case
(Literal(
false
, BooleanType), r)
=
> r
// l || false => l
case
(l, Literal(
false
, BooleanType))
=
> l
// a || a => a
case
(l, r)
if
l fastEquals r
=
> l
// (a && b) || (a && c) => a && (b || c)
case
...
}
// end of Or(left, right)
// 消除Not操作符, 直接取反义
case
not
@
Not(exp)
=
> exp
match
{
// not(true) => false, true的反义是false
case
Literal(
true
, BooleanType)
=
> Literal(
false
)
// not(false) => true
case
Literal(
false
, BooleanType)
=
> Literal(
true
)
// not(l > r) => l <= r
case
GreaterThan(l, r)
=
> LessThanOrEqual(l, r)
// not(l >= r) => l < r
case
GreaterThanOrEqual(l, r)
=
> LessThan(l, r)
// not(l < r) => l >= r
case
LessThan(l, r)
=
> GreaterThanOrEqual(l, r)
// not(l <= r) => l > r
case
LessThanOrEqual(l, r)
=
> GreaterThan(l, r)
// not(l || r) => not(l) && not(r)
case
Or(l, r)
=
> And(Not(l), Not(r))
// not(l && r) => not(l) or not(r)
case
And(l, r)
=
> Or(Not(l), Not(r))
// not(not(e)) => e
case
Not(e)
=
> e
case
_
=
> not
}
// end of Not(exp)
// if (true) a else b => a
// if (false) a else b => b
case
e
@
If(Literal(v,
_
), trueValue, falseValue)
=
>
if
(v
==
true
) trueValue
else
falseValue
}
}
}
|