Mysterious SQL Execution Plan

Zhengcaiyun technical team.png

Nanyin.png

The "mysterious" Spark SQL execution plan

As a data warehouse development engineer, in addition to mastering the modeling process, optimization model, task stratification and other knowledge, SQL development is also an important part of daily development. Whether SQL is written well or not will directly affect resource utilization and cluster performance. So how do we locate our SQL will there be performance problems?

​ In Spark SQL, in fact, not only Spark SQL, all SQL execution engines will provide us with execution plans, so that we can intuitively see how our SQL is executed and whether it is executed according to our expectations, and then we can Targeted optimization.

​ However, to people who are not familiar with execution plans, especially those who have not used Spark Core in depth, execution plans may seem particularly difficult to understand.

In this post, I will detail each point in the execution plan.

surroundings

​ Below our execution plan is based on such a SQL:

​ This SQL means to get the number of orders for each user in 2022, and to associate it with another table to get more detailed information (note ⚠️: here is used to get all fields

  select *
复制代码

In the actual production process, it is not allowed to write like this, and all field names must be completed.) It is a simple statement at first glance, so do you know how its execution plan is? I will come one by one below.

Implementation plan

​ We can get the corresponding physical execution plan by calling explain formatted:

​ explain can be followed by one of three keywords: formatted , cost, codegen . formatted will return a more intuitive and formatted result, cost will let us see the amount of data to be processed at each step, and codegen will tell us what the code to be executed looks like.

We can also view a graphical representation of the physical execution plan of SQL through the Spark UI:

​ Let's take a look at what each term stands for.

CollapseCodegenStages

​ 从上图中,我们能看到,操作符被划分成了一个个蓝框。每个蓝框都代表一个 codegen stage。如果了解 Spark,就能知道这就是一个 Stage。Spark 中根据是否进行 shuffle 划分 Stage。

​ CollapseCodegenStages 会将一些操作符合并到一起,这样能减少函数调用栈,从而提高资源的利用效率。但是,并非所有的操作符都能做合并,比如 Exchange(也就是shuffle)不支持。从物理计划的 codegen id 这部分输出,我们能判断操作符是否支持进行合并。如果支持,codegen id 后面会带对应 stage 的 id。

​ 那么,不同的操作符都是什么含义呢?

Scan parquet

​ 这个操作符代表从 Parquet 中读取数据。从这部分输出我们能看到会从 Parquet 中读取哪些列。因为 Parquet 是列式存储,所以只需要读取需要的列就好,这样能大幅提升性能。在这部分我们能看到有这么两个关键字: PartitionFilters 和 PushedFilters。PartitionFilters 代表我们要过滤掉哪些分区,而 PushedFilters 代表我们会将哪些操作下推给Parquet。也就是常说的执行下推。通过执行下推,我们可以在读数据时就过滤掉不需要的数据,从 Parquet 传输给 Spark 的数据量就会小,就能提升性能。

​ 如果列已经排过序,那么执行下推会发挥巨大的作用。因为 Parquet 的 Footer 会包含 min/max 等统计信息,Parquet 可以根据这个统计信息就能过滤掉数据,而不需要进入 Block 内部进行查找。排过序的话,就只需要读特定的 Block 就好了。而没有排过序的话,极端情况下需要读取全部 Block。

Filter

​ 这个操作符就如其名,就是过滤数据。但是它一般不对应 SQL 中的 where 条件。因为 Catalyst 优化器会进行优化:

​ • PushDownPredicates:Catalyst 会尽量将过滤条件下推。但是有些是下推不了的,因为下推是编译时能够确定的才能下推,而有些涉及到表达式计算的需要在执行时才能确定,比如 first 等操作,这些就下推不了。

​ • CombineFilters:会将两个能合并的算子合并成一个。

​ • InferFiltersFromConstraints:这个规则会创建一个新的过滤器。就比如我们上面 SQL 中的 Join 操作,会创建一个 user_id 不为空的过滤器。

​ • PruneFilters:删掉冗余的过滤器。

Project

​ 这个操作符表示要 select 哪些列。在从逻辑计划转变为物理计划之前,也会做一些优化:

​ • ColumnPruning:删掉不需要读取的列。

​ • CollapseProject:将两个 CollapseProject 合成为一个。

​ • PushProjectionThroughUnion:将 Project 下推给 union。

Exchange

​ 就是 shuffle。这部分会包括采用了哪种 Partitioner:

​ 上面这幅图的意思就是说,采用 HashPartitioner,将 user_id 按照 hash 分成 200 个分区。所有相同 user_id 的数据都会被放到一个分区中去。除了 HashPartitioner ,还有下面几种其它的 Partitioner:

​ • RoundRobinPartitioning:就是随机分区。了解负载均衡的同学应该很容易理解。

​ • SinglePartition:只有一个分区,全部数据都在一个分区。如果调用了窗口函数,导致整个 DataFrame 的数据都在一个窗口中,就会采用 SingleParititoner。

​ • RangePartitioning:用于对数据进行排序。Spark SQL 中的全局排序就是靠它。

HashAggregate

​ 表示数据聚合。一般会有两个 HashAggregate,中间夹着一个 Exchange。因为现在 Task 级别进行聚合,然后再全局聚合会提升性能。像 Spark 中的 reduceByKey,aggregateByKey,sortByKey 等都是这种思想。

BroadcastHashJoin & BroadcastExchange

​ 就是通过 Broadcast Join 的方式进行 Join。Broadcast Join 适用于一张表数据很小,一张表很大的情况。还有其它的一些 Join 方法,如 SortMergeJoin 和 ShuffleHashJoin。

​ Broadcast Join 总是会和 Broadcast Exchange 一起出现。

引发的思考

​ 我们来看下面的 SQL :

很简单,就是先根据分区过滤数据,然后做 count。我们看下它生成的执行计划:

*(2) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
   +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
      +- *(1) Project
         +- *(1) FileScan parquet wrk.test_funnel_log_parquet_properties_20220401_20220410[event_id#56,pdate#57] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://ddmcNS/user/hive/warehouse/wrk.db/test_funnel_log_parquet_properti..., PartitionCount: 44, PartitionFilters: [isnotnull(pdate#57), (pdate#57 >= 2022-04-01), (pdate#57 <= 2022-04-04), event_id#56 INSET (559,..., PushedFilters: [], ReadSchema: struct<>
复制代码

​ 很容易理解,先根据分区字段进行过滤,然后读 Parquet 文件,然后每个 Task 本地做局部 count,经过一次 Shuffle 之后再做一个全局排序。

​ 了解 Parquet 文件都明白,Parquet 中每个 RowGroup 是保存了行数的:

​ 那为什么 Spark 不在读 Parquet 文件的时候,直接读 num_rows 就好了,还要读整个 parquet 文件再去计算行数呢?这个问题我暂时还没找到答案,我认为这儿是有优化空间的。

总结

​ 今天介绍了 SparkSQL 的物理执行计划、一些具体的参数效果和实际例子以及自己的一点思考,希望对大家有所帮助,也祝愿大家后面能通过解析物理执行计划把自己的 SQL 优化的效率杠杠的,永无 BUG!

参考文献

SPARK 官网

Spark Project SQL 3.2.1 API

推荐阅读

人工智能 NLP 简述

浅析 ElasticJob-Lite 3.x 定时任务

雪花算法详解

基于流量域的数据全链路治理

招贤纳士

Zhengcaiyun technical team (Zero), a team full of passion, creativity and execution, Base is located in the picturesque Hangzhou. The team currently has more than 300 R&D partners, including "veteran" soldiers from Ali, Huawei, and NetEase, as well as newcomers from Zhejiang University, University of Science and Technology of China, Hangdian University and other schools. In addition to daily business development, the team also conducts technical exploration and practice in the fields of cloud native, blockchain, artificial intelligence, low-code platform, middleware, big data, material system, engineering platform, performance experience, visualization, etc. And landed a series of internal technology products, and continued to explore new boundaries of technology. In addition, the team has also devoted themselves to community building. Currently, they are contributors to many excellent open source communities such as google flutter, scikit-learn, Apache Dubbo, Apache Rocketmq, Apache Pulsar, CNCF Dapr, Apache DolphinScheduler, alibaba Seata, etc. If you want to change, you have been tossed with things, and you want to start tossing things; if you want to change, you have been told that you need more ideas, but you can't break the game; if you want to change, you have the ability to achieve that result, but you don't need you; if you If you want to change what you want to accomplish, you need a team to support it, but there is no place for you to lead people; if you want to change, you have a good understanding, but there is always a blur of that layer of window paper... If you believe in the power of belief, I believe that ordinary people can achieve extraordinary things, and I believe that they can meet a better self. If you want to participate in the process of taking off as the business takes off, and personally promote the growth of a technical team with in-depth business understanding, a sound technical system, technology creating value, and spillover influence, I think we should talk. Anytime, waiting for you to write something, send it to  [email protected]

WeChat public account

The article is released simultaneously, the public account of the technical team of Zhengcaiyun, welcome to pay attention

政采云技术团队.png

Guess you like

Origin juejin.im/post/7119270369369784351