Spark performance tuning guide is here!

1. What is Spark

SparkIt is a fast, general-purpose and scalable big data analysis computing engine based on memory.
insert image description here

Spark Core: Implemented Sparkbasic functions, including modules such as task scheduling, memory management, error recovery, and interaction with storage systems. Spark CoreAlso included in is the definition of Resilient Distributed Datasets ( Resilient Distributed DataSet, for short RDD) API.
Spark SQL: is a package used by Spark to manipulate structured data. Through Spark SQL, we can use SQLor Apache Hiveversion of HQLto query the data. Spark SQLSupport multiple data sources, such as Hivetables, Parquetand JSONso on.

insert image description here

2, Spark Shuffle analysis

2.1 HashShuffle

  1. unoptimizedHashShuffleManager
    insert image description here
  2. optimizedHashShuffleManager
    insert image description here

2.2 SortShuffle

insert image description here

3. Execution plan processing flow

Let’s first look at the process of sqltransforming from one to Rdd:
insert image description here
the core execution process has 5three steps:
insert image description here
these operations and plans are Spark SQLautomatically processed, and the following plans will be generated:
➢Logical Unresolvedexecution plan: Parsed Logical Plan
Parserthe component checks SQLfor syntax problems, such as writing less comma, write less FROM, and then generate Unresolved(undecided) logical plan, do not check the table name, do not check the column name.
➢Logical Resolvedexecution plan: parse the validation semantics, column names, types, table names, etc. by accessing the repository Analyzed Logical Plan
in Spark, that is to say, verify whether the table names and column names exist. ➢ Optimized logical execution plan: the optimizer optimizes according to various rules, such as predicate pushdown. ➢ Physical execution plan: 1) Operators represent data aggregation, usually appearing in pairs, the first is to locally aggregate the local data of the execution node, and the other is to further aggregate and calculate the data of each partition. 2) The operator is actually , indicating that data needs to be moved on the cluster. Many times will be separated by . 3) The operator is the projection operation in , which is to select columns (for example: ...). 4)Catalog
Optimized Logical Plan
Catalyst
Physical Plan
HashAggregateHashAggregateHashAggregateHashAggregate
ExchangeshuffleHashAggregateExchange
ProjectSQLselect name, age
BroadcastHashJoinOperator representation is done in a broadcast-based manner HashJoin.
5) LocalTableScanThe operator is the full table scan of the local table
➢Cost CBOselection:选择最优的执行计划

4. SparkSQL syntax optimization

4.1 Large and small table join

If a small table is small enough and can be cached in the memory first, it can be used Broadcast Hash Join. The principle is to aggregate the small table to driverthe end first, and then broadcast to each large table partition. joinWhen it is performed again, it is equivalent to the large table The data of the respective partitions are localized with the small table join, thereby avoiding the problem shuffle.
1) Specify the automatic broadcast broadcast by parameter . joinThe default value is controlled by the parameter. 2) Forcibly broadcast Syntax:10MB
spark.sql.autoBroadcastJoinThreshold
join
SELECT /*+ broadcast(a) */ FROM a JOIN b ON

4.2 Large table and large table join

SMB JOINIt is sort merge bucketan operation that needs to be divided into buckets. First, it will be sorted, and then keymerged according to the value, and put the same keydata into the same bucket(according keyto hash). The purpose of bucketing is actually to turn large tables into small tables. keyAfter the same data is in the same bucket, and then operate , jointhen the scan of irrelevant items will be greatly reduced during the union.
Conditions of use:
(1) Two tables are divided into buckets, and the number of buckets must be equal
(2) joinWhen both sides are performed, joincolumn = sorting column = bucketing column

5. RBO-based optimization

5.1 Predicate Pushdown

Execute the predicate logic of filter conditions as early as possible to reduce the amount of data processed downstream. Pushdown predicates can greatly reduce the amount of data scanning and disk I/Ooverhead.

5.2 Column Pruning

Column pruning is to read only those fields related to the query when scanning the data source.

5.3 Constant Folding

If we selectmix some constant expressions in the statement, Catalystit will be automatically replaced with the result of the expression.

6. CBO-based optimization

The optimization described above RBObelongs to the logical plan, which only considers the query and does not consider the characteristics of the data itself. The following will introduce CBOhow to use the characteristics of the data itself to optimize the physical execution plan.
CBOOptimization is mainly at the physical plan level. The principle is to calculate the cost of all possible physical plans and select the physical execution plan with the least cost. The characteristics of the data itself (such as size and distribution) and the characteristics of the operator (distribution and size of the intermediate result set) and cost are fully considered, so as to better select the physical execution plan with the least execution cost.

6.1 Official experiments

CBOBefore optimization:
insert image description here
CBOAfter optimization:
insert image description here
The physical execution plan is a tree structure, and its cost is equal to the sum of the costs of each execution node, as shown in the figure below.
insert image description here
The cost of each execution node is divided into two parts:
the impact of the execution node on the data set, or the size and distribution of the output data set of the node

  • The cost of the execution node operation operator
  • The cost of each operator is relatively fixed, which can be described by rules. The size and distribution of the execution node output data set are divided into two parts:
    1. The initial data set, that is, the original table, the size and distribution of the data set can be obtained directly through statistics;
    2. The size and distribution of the output data set of the intermediate node can be estimated from the information of the input data set and the characteristics of the operation itself.

Therefore, in the end, there are mainly two problems to be solved

  1. How to get the statistics of the original dataset
  2. How to estimate the output dataset of a specific operator based on the input dataset

6.2 How to optimize CBO

1 Statistics collection (relevant information collected in advance)

Specific SQLstatements need to be executed first to collect the required table and column statistics.
➢ Generate table-level statistical information (scan table):

ANALYZE TABLE 表名 COMPUTE STATISTICS

Generate sizeInBytes(size of this table) and rowCount(how many rows of this table).
From the following example, Statisticsone row shows that customerthe total size of table data is 37026233bytes, that is 35.3MB, the total number of records is 28ten thousand.

spark-sql> ANALYZE TABLE customer COMPUTE STATISTICS;
Time taken: 12.888 seconds
​
spark-sql> desc extended customer;
c_customer_sk bigint   NULL
c_customer_id string   NULL
c_current_cdemo_sk     bigint NULL
c_current_hdemo_sk     bigint NULL
c_current_addr_sk       bigint NULL
c_first_shipto_date_sk bigint NULL
c_first_sales_date_sk   bigint NULL
c_salutation   string   NULL
c_first_name   string   NULL
c_last_name   string   NULL
c_preferred_cust_flag   string NULL
c_birth_day   int     NULL
c_birth_month int     NULL
c_birth_year   int     NULL
c_birth_country string NULL
c_login string NULL
c_email_address string NULL
c_last_review_date     string NULL# Detailed Table Information
Database       jason_tpc_ds
Table   customer
Owner   jason
Created Time   Sat Sep 15 14:00:40 CST 2018
Last Access   Thu Jan 01 08:00:00 CST 1970
Created By     Spark 2.3.2
Type   EXTERNAL
Provider       hive
Table Properties       [transient_lastDdlTime=1536997324]
Statistics     37026233 bytes, 280000 rows
Location       hdfs://dw/tpc_ds/customer
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat   org.apache.hadoop.mapred.TextInputFormat
OutputFormat   org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Storage Properties     [field.delim=|, serialization.format=|]
Partition Provider     Catalog
Time taken: 1.691 seconds, Fetched 36 row(s)

➢ Generate column level statistics

ANALYZE TABLE 表名 COMPUTE STATISTICS FOR COLUMNS1,2,3

As can be seen from the following example, the minimum value of the column customerin the table is , the maximum value is , the number of values ​​is , the number of different values ​​is , the average column length is , and the maximum column length is .c_customer_sk1280000null027436888

spark-sql> ANALYZE TABLE customer COMPUTE STATISTICS FOR COLUMNS c_customer_sk, c_customer_id, c_current_cdemo_sk;
Time taken: 9.139 seconds
spark-sql> desc extended customer c_customer_sk;
col_name       c_customer_sk
data_type     bigint
comment NULL
min     1
max     280000
num_nulls     0
distinct_count 274368
avg_col_len   8
max_col_len   8
histogram     NULL

2 Estimation of the operator's influence on the data set

For intermediate operators, the statistical results of the output data set can be estimated according to the statistical information of the input data set and the characteristics of the operator.
insert image description here

This section takes Filteras an example to illustrate the impact of the operator on the dataset.
For common ones Column A < value B Filter, the statistics of the output intermediate results can be estimated as follows

  • If A.min > B, no data is selected and the output result is empty
  • If A.max < B, all the data is selected, the output result is Athe same as , and the statistical information remains unchanged
  • If A.min < B < A.max, the proportion of the selected data is (B.value - A.min) / (A.max - A.min), A.minunchanged, A.maxupdated asB.value

3 Operator cost estimation

SQLCommon operations in are Selection( selectrepresented by the statement), Filter( whererepresented by the statement), and Cartesian product ( joinrepresented by the statement). The most expensive of them all join. The cost of
Spark SQLis CBOestimated byjoin

Cost = rows * weight + size * (1 - weight)
Cost = CostCPU * weight + CostIO * (1 - weight)

Among them rows, the number of record lines represents CPUthe cost, sizewhich represents IOthe cost. weightDetermined by spark.sql.cbo.joinReorder.card.weight, which defaults to 0.7.

6.3 CBO Optimizes Build Side Selection

For the two tables Hash Join, the small table is generally selected as the build sizehash table, and the other side is used as the hash table probe side. When it is not enabled CBO, select it according to the original data size of the table t2. After build side
insert image description here
it is enabled , select it CBObased on the estimated cost . more suitable for this examplet1build side
insert image description here

6.4 Optimizing Join Types

In Spark SQL, Joincan be divided into Shuffle based Joinand BroadcastJoin. Shuffle based JoinIt needs to be introduced Shuffle, and the cost is relatively high. BroadcastJoinNo Join, but at least one table is required to be small enough to be broadcast to every s through the Sparks mechanism . In Disabled , judge whether to enable it by . Its default value is ie . And that judgment is based on the original size of the participating tables. In the example in the figure below, the size is , and the size is , so when performing both , since both are far greater than the threshold of automatic , so choose to perform both when is not enabled . When is enabled , the size of the result set after passing is , and the size of the result set after passing is lower than the automatic threshold, so is selected .BroadcastExecutor
CBOSpark SQLspark.sql.autoBroadcastJoinThresholdBroadcastJoin1048576010 MBJoin
Table 11 TBTable 220 GBjoinBroatcastJoinSpark SQLCBOSortMergeJoinJoin
CBOTable 1Filter 1500 GBTable 2Filter 210 MBBroatcastJoinSpark SQLBroadcastJoin

insert image description here

6.5 Optimizing the order of multi-table Join

When not enabled CBO, proceed Spark SQLin the order SQLof Middle . In extreme cases, the entire may be . As shown in the figure below , the multi-channel has the following problems.joinJoinJoinleft-deep treeTPC-DS Q25Join

  1. left-deep tree, so all subsequent operations Joindepend on the previous Joinresults and Joincannot be performed in parallel.
  2. The previous two Joininput and output data volumes are very large, which is large Joinand takes a long time to execute.

insert image description here
When enabled CBO, Spark SQLoptimize the execution plan as follows:
insert image description here

6.6 Using CBOs

Enable by " spark.sql.cbo.enabled", the default is false. After the configuration is enabled CBO, CBOthe optimizer can make a series of estimates based on the statistical information of tables and columns, and finally select the optimal query plan. For example: Buildside selection, optimization Jointype, optimization of multi-table Joinorder, etc.
The following is a description of the relevant parameters:
insert image description here

Summarize

This article first explains the tuning Sparkof the bottom layer of and the entire processing flow from generating execution plans to , followed by syntax optimization, and finally sorts out how to optimize based on and !ShuffleSQLRDDSpark SQLSpark SQLRBOCBO

Guess you like

Origin blog.csdn.net/u011109589/article/details/132019001