1. What is Spark
Spark
It is a fast, general-purpose and scalable big data analysis computing engine based on memory.
Spark Core
: Implemented Spark
basic functions, including modules such as task scheduling, memory management, error recovery, and interaction with storage systems. Spark Core
Also included in is the definition of Resilient Distributed Datasets ( Resilient Distributed DataSet
, for short RDD
) API
.
Spark SQL
: is a package used by Spark to manipulate structured data. Through Spark SQL
, we can use SQL
or Apache Hive
version of HQL
to query the data. Spark SQL
Support multiple data sources, such as Hive
tables, Parquet
and JSON
so on.
2, Spark Shuffle analysis
2.1 HashShuffle
- unoptimized
HashShuffleManager
- optimized
HashShuffleManager
2.2 SortShuffle
3. Execution plan processing flow
Let’s first look at the process of sql
transforming from one to Rdd
:
the core execution process has 5
three steps:
these operations and plans are Spark SQL
automatically processed, and the following plans will be generated:
➢Logical Unresolved
execution plan: Parsed Logical Plan
Parser
the component checks SQL
for syntax problems, such as writing less comma, write less FROM
, and then generate Unresolved
(undecided) logical plan, do not check the table name, do not check the column name.
➢Logical Resolved
execution plan: parse the validation semantics, column names, types, table names, etc. by accessing the repository Analyzed Logical Plan
in Spark
, that is to say, verify whether the table names and column names exist. ➢ Optimized logical execution plan: the optimizer optimizes according to various rules, such as predicate pushdown. ➢ Physical execution plan: 1) Operators represent data aggregation, usually appearing in pairs, the first is to locally aggregate the local data of the execution node, and the other is to further aggregate and calculate the data of each partition. 2) The operator is actually , indicating that data needs to be moved on the cluster. Many times will be separated by . 3) The operator is the projection operation in , which is to select columns (for example: ...). 4)Catalog
Optimized Logical Plan
Catalyst
Physical Plan
HashAggregate
HashAggregate
HashAggregate
HashAggregate
Exchange
shuffle
HashAggregate
Exchange
Project
SQL
select name, age
BroadcastHashJoin
Operator representation is done in a broadcast-based manner HashJoin
.
5) LocalTableScan
The operator is the full table scan of the local table
➢Cost CBO
selection:选择最优的执行计划
4. SparkSQL syntax optimization
4.1 Large and small table join
If a small table is small enough and can be cached in the memory first, it can be used Broadcast Hash Join
. The principle is to aggregate the small table to driver
the end first, and then broadcast to each large table partition. join
When it is performed again, it is equivalent to the large table The data of the respective partitions are localized with the small table join
, thereby avoiding the problem shuffle
.
1) Specify the automatic broadcast broadcast by parameter . join
The default value is controlled by the parameter. 2) Forcibly broadcast Syntax:10MB
spark.sql.autoBroadcastJoinThreshold
join
SELECT /*+ broadcast(a) */ FROM a JOIN b ON
4.2 Large table and large table join
SMB JOIN
It is sort merge bucket
an operation that needs to be divided into buckets. First, it will be sorted, and then key
merged according to the value, and put the same key
data into the same bucket
(according key
to hash
). The purpose of bucketing is actually to turn large tables into small tables. key
After the same data is in the same bucket, and then operate , join
then the scan of irrelevant items will be greatly reduced during the union.
Conditions of use:
(1) Two tables are divided into buckets, and the number of buckets must be equal
(2) join
When both sides are performed, join
column = sorting column = bucketing column
5. RBO-based optimization
5.1 Predicate Pushdown
Execute the predicate logic of filter conditions as early as possible to reduce the amount of data processed downstream. Pushdown predicates can greatly reduce the amount of data scanning and disk I/O
overhead.
5.2 Column Pruning
Column pruning is to read only those fields related to the query when scanning the data source.
5.3 Constant Folding
If we select
mix some constant expressions in the statement, Catalyst
it will be automatically replaced with the result of the expression.
6. CBO-based optimization
The optimization described above RBO
belongs to the logical plan, which only considers the query and does not consider the characteristics of the data itself. The following will introduce CBO
how to use the characteristics of the data itself to optimize the physical execution plan.
CBO
Optimization is mainly at the physical plan level. The principle is to calculate the cost of all possible physical plans and select the physical execution plan with the least cost. The characteristics of the data itself (such as size and distribution) and the characteristics of the operator (distribution and size of the intermediate result set) and cost are fully considered, so as to better select the physical execution plan with the least execution cost.
6.1 Official experiments
CBO
Before optimization:
CBO
After optimization:
The physical execution plan is a tree structure, and its cost is equal to the sum of the costs of each execution node, as shown in the figure below.
The cost of each execution node is divided into two parts:
the impact of the execution node on the data set, or the size and distribution of the output data set of the node
- The cost of the execution node operation operator
- The cost of each operator is relatively fixed, which can be described by rules. The size and distribution of the execution node output data set are divided into two parts:
- The initial data set, that is, the original table, the size and distribution of the data set can be obtained directly through statistics;
- The size and distribution of the output data set of the intermediate node can be estimated from the information of the input data set and the characteristics of the operation itself.
Therefore, in the end, there are mainly two problems to be solved
- How to get the statistics of the original dataset
- How to estimate the output dataset of a specific operator based on the input dataset
6.2 How to optimize CBO
1 Statistics collection (relevant information collected in advance)
Specific SQL
statements need to be executed first to collect the required table and column statistics.
➢ Generate table-level statistical information (scan table):
ANALYZE TABLE 表名 COMPUTE STATISTICS
Generate sizeInBytes
(size of this table) and rowCount
(how many rows of this table).
From the following example, Statistics
one row shows that customer
the total size of table data is 37026233
bytes, that is 35.3MB
, the total number of records is 28
ten thousand.
spark-sql> ANALYZE TABLE customer COMPUTE STATISTICS;
Time taken: 12.888 seconds
spark-sql> desc extended customer;
c_customer_sk bigint NULL
c_customer_id string NULL
c_current_cdemo_sk bigint NULL
c_current_hdemo_sk bigint NULL
c_current_addr_sk bigint NULL
c_first_shipto_date_sk bigint NULL
c_first_sales_date_sk bigint NULL
c_salutation string NULL
c_first_name string NULL
c_last_name string NULL
c_preferred_cust_flag string NULL
c_birth_day int NULL
c_birth_month int NULL
c_birth_year int NULL
c_birth_country string NULL
c_login string NULL
c_email_address string NULL
c_last_review_date string NULL
# Detailed Table Information
Database jason_tpc_ds
Table customer
Owner jason
Created Time Sat Sep 15 14:00:40 CST 2018
Last Access Thu Jan 01 08:00:00 CST 1970
Created By Spark 2.3.2
Type EXTERNAL
Provider hive
Table Properties [transient_lastDdlTime=1536997324]
Statistics 37026233 bytes, 280000 rows
Location hdfs://dw/tpc_ds/customer
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.TextInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Storage Properties [field.delim=|, serialization.format=|]
Partition Provider Catalog
Time taken: 1.691 seconds, Fetched 36 row(s)
➢ Generate column level statistics
ANALYZE TABLE 表名 COMPUTE STATISTICS FOR COLUMNS 列 1,列 2,列 3
As can be seen from the following example, the minimum value of the column customer
in the table is , the maximum value is , the number of values is , the number of different values is , the average column length is , and the maximum column length is .c_customer_sk
1
280000
null
0
274368
8
8
spark-sql> ANALYZE TABLE customer COMPUTE STATISTICS FOR COLUMNS c_customer_sk, c_customer_id, c_current_cdemo_sk;
Time taken: 9.139 seconds
spark-sql> desc extended customer c_customer_sk;
col_name c_customer_sk
data_type bigint
comment NULL
min 1
max 280000
num_nulls 0
distinct_count 274368
avg_col_len 8
max_col_len 8
histogram NULL
2 Estimation of the operator's influence on the data set
For intermediate operators, the statistical results of the output data set can be estimated according to the statistical information of the input data set and the characteristics of the operator.
This section takes Filter
as an example to illustrate the impact of the operator on the dataset.
For common ones Column A < value B Filter
, the statistics of the output intermediate results can be estimated as follows
- If
A.min > B
, no data is selected and the output result is empty - If
A.max < B
, all the data is selected, the output result isA
the same as , and the statistical information remains unchanged - If
A.min < B < A.max
, the proportion of the selected data is(B.value - A.min) / (A.max - A.min)
,A.min
unchanged,A.max
updated asB.value
3 Operator cost estimation
SQL
Common operations in are Selection
( select
represented by the statement), Filter
( where
represented by the statement), and Cartesian product ( join
represented by the statement). The most expensive of them all join
. The cost of
Spark SQL
is CBO
estimated byjoin
Cost = rows * weight + size * (1 - weight)
Cost = CostCPU * weight + CostIO * (1 - weight)
Among them rows
, the number of record lines represents CPU
the cost, size
which represents IO
the cost. weight
Determined by spark.sql.cbo.joinReorder.card.weight
, which defaults to 0.7
.
6.3 CBO Optimizes Build Side Selection
For the two tables Hash Join
, the small table is generally selected as the build size
hash table, and the other side is used as the hash table probe side
. When it is not enabled CBO
, select it according to the original data size of the table t2
. After build side
it is enabled , select it CBO
based on the estimated cost . more suitable for this examplet1
build side
6.4 Optimizing Join Types
In Spark SQL
, Join
can be divided into Shuffle based Join
and BroadcastJoin
. Shuffle based Join
It needs to be introduced Shuffle
, and the cost is relatively high. BroadcastJoin
No Join
, but at least one table is required to be small enough to be broadcast to every s through the Spark
s mechanism . In Disabled , judge whether to enable it by . Its default value is ie . And that judgment is based on the original size of the participating tables. In the example in the figure below, the size is , and the size is , so when performing both , since both are far greater than the threshold of automatic , so choose to perform both when is not enabled . When is enabled , the size of the result set after passing is , and the size of the result set after passing is lower than the automatic threshold, so is selected .Broadcast
Executor
CBO
Spark SQL
spark.sql.autoBroadcastJoinThreshold
BroadcastJoin
10485760
10 MB
Join
Table 1
1 TB
Table 2
20 GB
join
BroatcastJoin
Spark SQL
CBO
SortMergeJoin
Join
CBO
Table 1
Filter 1
500 GB
Table 2
Filter 2
10 MB
BroatcastJoin
Spark SQL
BroadcastJoin
6.5 Optimizing the order of multi-table Join
When not enabled CBO
, proceed Spark SQL
in the order SQL
of Middle . In extreme cases, the entire may be . As shown in the figure below , the multi-channel has the following problems.join
Join
Join
left-deep tree
TPC-DS Q25
Join
left-deep tree
, so all subsequent operationsJoin
depend on the previousJoin
results andJoin
cannot be performed in parallel.- The previous two
Join
input and output data volumes are very large, which is largeJoin
and takes a long time to execute.
When enabled CBO
, Spark SQL
optimize the execution plan as follows:
6.6 Using CBOs
Enable by " spark.sql.cbo.enabled
", the default is false
. After the configuration is enabled CBO
, CBO
the optimizer can make a series of estimates based on the statistical information of tables and columns, and finally select the optimal query plan. For example: Build
side selection, optimization Join
type, optimization of multi-table Join
order, etc.
The following is a description of the relevant parameters:
Summarize
This article first explains the tuning Spark
of the bottom layer of and the entire processing flow from generating execution plans to , followed by syntax optimization, and finally sorts out how to optimize based on and !Shuffle
SQL
RDD
Spark SQL
Spark SQL
RBO
CBO